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Remarks 

Regarding amendments in the Claims: 

Claims 6-9 and 15-19 were allowed in the Final Office Action mailed May 13, 2005. Claims 10-14 and 
20-24 were rejected in that Final Office Action. In response to the rejection, applicants have amended 
claims 10, 20, and 24. And clarification regarding claims 14 and 24 is respectfully submitted. 

Applicants have previously submitted Remarks (or Arguments) for patentability of presently pending 
claims in an Amendment/Response filed December 23, 2004 and a Supplemental 
Amendment/Response January 26, 2005. And applicants respectfully direct the Examiner to these 
Remarks (including the "Stamp Pasting Analogy" and Illustration on pages 18 and 19 of the 
Supplemental Amendment) and will, in general, not repeat those Remarks here. As is well known, 
support for claims need not be verbatim ("ipsis verbis" or "in haec verba"), but only described in 
sufficient detail that one skilled in the art can reasonably conclude that the inventor had possession of 
the claimed invention (see, e.g., Vas-Cath, Inc. v. Mahurkar, 935 F.2d at 1563, 19 USPQ2d at 1116). 

Regarding claims 10 and 20 (and their dependent claims) The Examiner has rejected these claims 
under 35 U.S.C. 112, second paragraph as being indefinite because of the language "nearly identical" 
and "described in (1): (1) any one CL-F point..". 

Applicants have amended claims 10 and 20 and respectfully submit that the scope of these newly 
amended claims is not decreased. The limitation containing the language "nearly identical" 
accomplishes the elimination of pairs of redundant markers in the same subset (that provide the same 
information). Applicants have responded and amended claims 10 and 20 by deleting the limitation 
containing the language "nearly identical". See [0321], which states that Step 3, the elimination of pairs 
of redundant markers in the same subset (that provide the same information), is not essential. 
Therefore, the deletion of the limitation is supported. 

The language "described in (1): (1) Any one CL-F point.." has also been deleted from claims. Applicants 
have simplified and clarified the language in claims 10 and 20. 
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The CL-F region in claims 10 and 20 is now specified to be a segment-subrange, wherein the segment 
of the segment-subrange is the region of interest or the chromosome. (See [0275] bottom of page 20 
which states that the length of a chromosomal segment can be as long as a chromosome.) Support for 
the limitations or language in claims 10 and 20 has been given in the Supplemental Amendment of 
January 26, 2005. Specifically p. 13 of that Supplemental Amendment states that any CL-F region (any 
collection of one or more points [0050]) is an example of a CL-F region that is systematically covered by 
versions of the invention. A segment-subrange is an example of such a CL-F region, see [0090]; and 
see [01 85] "Specific types of CL-F regions that are N covered are useful. For example, a 
rectangular CL-F region, a segment-subrange,...". 

And see top of page p. 1 6 of the Supplemental Amendment of Jan 2005 which gives support for the 
limitation "wherein each point in the CL-F region is N-covered to within [L, y] by markers belonging to a 
subset, L is the length of the longest segment, y is 0. 15 and N>2,". This limitation follows directly from 
the facts that the markers in each subset belong to only one segment (whose maximum length is L), the 
fact that the difference between the least common allele frequencies of any two subset markers does 
not exceed 0.15, and that there are two or more markers in each subset. See also the "Stamp Pasting 
Analogy" and Illustration on p. 16, 18 and 19 of the Supplemental Amendment of Jan. 2005. 

Regarding claims 14 and 24 The Examiner has rejected these claims as indefinite under 35 U.S.C. 
112, second paragraph because of the language "thousands of bi-allelic covering markers". Applicants 
respectfully offer the following clarification. The concept of "thousands of bi-allelic [covering] markers" (in 
connection with the physical implementation the new, two-dimensional linkage study techniques of this 
application) using silicon chips or glass slides containing oligonucleotides is described in [0322], [0323], 
and [0324]. Included in this description is the paper cited in endnote 8, that is incorporated by reference 
into the application (Accessing Genetic Information with High-Density DNA Arrays . Mark Chee, et al. 
Science, vol 274, Oct. 25, 1996, pp. 610 - 614). Other similar papers such as Large Scale Identification, 
Mapping, and Genotyping of Single-Nucleotide Polymorphisms in the Human Genome, Wang, et. al., 
Science, May 15, 1998, vol 280, pp. 1077-1081 in endnote 9, are also incorporated by reference into 
the application. 
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Applicants respectfully submit that given this description and the knowledge of one of ordinary skill, the 
limitation "thousands of bi-allelic covering markers" is definite. 'Thousands" is the plural of "thousand" 
meaning literally 2000 or more. Thousands is also a very large number. (See enclosed copy of page 
1228 of the Tenth Edition of Merriam -Webster's Collegiate Dictionary giving the definition of "thousands" 
as plural of "thousand".) The abstract of the Chee paper (endnote 8) describes "DNA arrays containing 
up to 135, 000 probes" and page 610 of this paper (bottom left hand column) describes an "array of a 
large number of oligonucleotide probes". And the next to last paragraph of the Wang paper (endnote 9) 
on page 1081 describes a "2000-SNP genotyping chip". 

The disadvantages of the conventional, generally slower, nucleic acid sequencing technologies 
(compared to high-density DNA arrays that use large numbers of oligonucleotide probes) is described in 
the first column, page 610 of the Chee paper. Both the Chee and Wang papers describe querying the 
entire human genome (estimated in Chee at 100, 000 genes) using a high-density array (p. 613 Chee 
and last two paragraphs p. 1081 Wang). No definite upper limit to the number of probes (and by 
implication number of markers) for the technology is given. It is believed that Tor example, the entire 
set of ~ 10 12 20-nucleotide oligomer probes, or any desired subset, can be synthesized. . . The number of 
probes that can be synthesized is limited only by the physical size of the array and the achievable 
lithographic resolution," (first paragraph, right hand column p. 610 Chee). 

The expressions "thousands of genes", "thousands of oligonucleotides", "thousands of bi-allelic 
markers" or similar expressions were used in the art at the time of filing of the application and are still 
being used. These expressions are often used in connection with high-density DNA arrays and specific 
example numbers (that are in the thousands). The applicants will cite several examples of this usage in 
the art below. 
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Given the extensive usage of these expressions and knowledge in the art, applicants respectfully submit 
the limitation "thousands of bi-allelic covering markers" is definite. In In re Corr (146 USPQ 69), the 
Court found that the phrase "high styrene resin" was definite and rejected the argument the phrase 
represented "undue breadth or overclaiming". The Court noted that the specification stated that the 
"high styrene resin" was a resin such as PLIOLITE S-6B. And the Court stated: "Appellant's specification 
taken with the prior art clearly indicates that the styrene resin component of his composition is 
conventional and many equivalents are known in the art" (146 USPQ at 71). Applicants respectfully 
submit that in the present application (as in In re Corr) examples of "thousands" have been given (i.e. 
135, 000 probes, 100, 000 genes in the Chee paper; 2000-SNP genotyping chip in the Wang paper). 
And numerous equivalents of these examples of "thousands" were known in the art at the time of filing. 
Applicants will cite evidence that there were such numerous equivalents of "thousands". known in the art 
in the following two paragraphs. 

Applicants respectfully direct the Examiner's attention to last paragraph on p. 772 the Fodor paper 
(Science, 1991, vol. 251, pp. 767-773). This paragraph describes a high-density array with 65, 536 
oligonucleotides. The Fodor paper is cited as a reference in the Chee paper (note 5, p. 613). The Cann 
paper (C R Acad Sci III June 1998; 321(6):443-6) uses the phrase "thousands of DNA polymorphisms 
(genetic markers)" and "thousands of more stable single nucleotide polymorphisms that detect variation 
on average once every ~ 1000 base pairs" (see Abstract p. 443 and p. 445 left column bottom 
paragraph). The DeRisi paper (Science vol 278 October 1 997 pp. 680-686) uses the phrase "DNA 
microarrays, consisting of thousands of individual gene sequences printed in a high-density array on a 
glass slide/' (p. 680 2nd paragraph left most column). The DeRisi paper also describes the amplification 
of 6000 genes and microarrays with 6400 elements in each array (see p. 685 last paragraph). The 
Lashkari paper (Proc Natl Acad Sci USA vol 94, pp. 13057-13062 Nov. 1997) describes high density 
DNA arrays containing 2,479 yeast ORFs and 6, 1 00 ORFs (see Abstract p.1 3057 and next to last 
paragraph p. 13062). The Johnston paper (Current Biology Feb 26, 1998, 8(5) pp. R171-R174) uses the 
phrases "thousands of genes" and "thousands of DNA fragments" in connection with high-density DNA 
arrays. And Johnston also describes "current oligonucleotide chips display all 6000 yeast genes on four 
1.28 x 1.28 cm chips., or 1.8 x 1.8 cm glass slide" (see Abstract and second paragraph p. R1 71 ). 

A Nature Genetics Supplement vol 21 January 1999 has a large number of papers on DNA microarrays. 
For example, the Brown paper describes "arrays of thousands of discrete DNA sequences (for example, 
all 6200 known and predicted genes ofS. cerevisiae" (see p. 33 last paragraph). And the Lipshutz paper 
describes hundreds of thousands of oligonucleotides in an array and gives a specific number example 
of approximately 300,000 (see Abstract and second paragraph p. 20). 
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Copies of cited pages in the above papers (rather than ail the pages) are included for the Examiner's 
convenience. These are: Chee pp. 610 and 613, Wang pp. 1077 and 1081, Fodor pp. 767 and 772, 
Cann pp. 443 and 445, DeRisi pp. 680 and 685, Lashkari pp. 13057 and 13062, Johnston p. R171, 
Brown p. 33 and Lipshutz p. 20. 

As stated above, the phrase "thousands of bi-allelic markers" is included under Physical Implementation 
[0322] of the new, two-dimensional linkage study techniques of this application. Thousands of bi-allelic 
markers are thus described as a tool or implement for use by two-dimensional linkage study techniques. 
Indeed at the time the application was filed, the whole field of association studies is looking to use 
thousands of bi-allelic markers. See for example Risch, N. and Merikangas, K.: The Future of Genetic 
Studies of Complex Human Diseases. Science, 13 September 1996, vol. 273, pp. 1516-1517 cited in 
[0027] of the application. This Risch paper (see p. 1517 mid left most column) describes using 
technological advances to do association testing of five diallelic (or bi-allelic) polymorphisms within each 
of 100, 000 genes (a total of 500, 000 polymorphisms tested in the association study). And the 
inventor's paper is a generalization of the Risch and Merikangas analysis [0029]. A copy of the Risch 
paper is included herewith. 

Thus the use of thousands of bi-allelic covering markers as recited in claims 14 and 24 is supported and 
is definite. Even if the process of claim 14 used, for example, 2 million covering markers, it would 
necessarily use thousands of covering markers and be included in the scope of the claim. And even if 
there were for example, 2 million covering markers in the group of two or more bi-allelic covering 
markers as recited in claim 24, there would necessarily be thousands of covering markers in the group. 
And such an embodiment would be within the scope of claim 24. 

Regarding new claim 25 Newly added claim 25 contains the language "nearly identical" which caused 
the Examiner to reject claim 10 for lack of definiteness. Newly added claim 25 deals with redundancy of 
markers and makes use of description recited in [0315], [0316] and [0317]. Similar description is found 
in [0268] to [0271]. Applicants respectfully submit that new claim 25 is definite. Specifically when 
markers are redundant and are in extreme positive linkage disequilibrium then every chromosome in the 
population that carries allele A also caries allele B and every chromosome that carries not allele A also 
carries not allele B, or this is nearly the situation [0316]. Under these circumstances the genotype of an 
individual at one marker will almost always predict the genotype of the individual at the other marker. 
Similarly allele frequency for an allele at one marker for a sample will predict with a very high degree of 
certainty or precision the allele frequency for an allele at the other marker. 
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Though there is relative language in claim 25. This relative language does not render the claim 
indefinite. As stated in the MPEP 2173.05(b) "The fact that claim language, including terms of degree, 
may not be precise, does not automatically render the claim indefinite under 35 (JSC 1 12, second 
paragraph. (Seattle Box Co. v. Industrial Crating & Packing, Inc. 221 USPQ 568). Acceptability of the 
claim language depends on whether one of ordinary skill in the art would understand what is claimed in 
light of the specification." Applicants respectfully submit that the language used in claim 25 is as precise 
or accurate as the subject matter allows. It is not possible to reasonably specify parameters such as 
allele frequency differences or departures from maximal linkage disequilibrium between redundant 
markers with a greater degree of accuracy or precision. A clear result or effect of redundancy of markers 
(and the relative difference of "nearly identical information") is specified in the claim, specifically that 
there would be no increase in the likelihood of detecting linkage. The limitation is as follows: "wherein 
the inclusion of a bi-allelic marker in the subset so that there would be a redundant pair in the subset 
would not increase the likelihood of detecting linkage and association of the trait-causing 
polymorphism". Applicants respectfully submit the situation is similar to that in Orthokinetics, Inc. v. 
Safety Chairs, Inc. (1 USPQ 2d 1081), which is cited in MPEP 2173.05(b). In that case, the Court found 
that a claim to a chair "so dimensioned" as to fit between an automobile doorframe and one of the seats 
was definite. The Court said the phrase "so dimensioned" is as accurate or precise as the subject matter 
permits. 

Similarly applicants respectfully submit that a redundant marker pair is defined in terms of what it does. 
The markers of the pair provide nearly identical information, and so the addition of one of the markers 
does not increase the likelihood of detecting linkage. This is similar to a functional limitation as 
described in MPEP 2173.05(g), which were found definite in In re Barr (170 USPQ 33) and In re 
Venezia (189 USPQ 149). 

Claims 12 and 22 have also been amended to bring them into harmony with the language in claims 
from which they depend. Their scope is unchanged. 



RCE, Amendment/Response for application 10/037, 718 Applicants MCGINNIS ET AL 14 
September 13, 2005 

Conclusion 

An RCE has been filed and claims 10, 20 and 24 have been amended in response to the Examiner's 
rejection and new claim 25 has been added. Claims 12 and 22 have also been amended. 
Remarks/Arguments in this Response have addressed each point of rejection in the Final Office Action. 

For the reasons advanced above, applicants respectfully submit that the application is now in condition 
for allowance and that action is earnestly solicited. 

Respectfully submitted, 

Robert O. McGinnis 
Registration No. 44, 232 

September 13, 2005 
1575 West Kagy Blvd. 
Bozeman, MT. 59715 
tel (406)-522-9355 



The Future of Genetic Studies of 
Complex Human Diseases 

Neil Risch and Kathleen Merikangas 



Geneticists have made substantial progress in 
identifying the genetic basis of many human 
diseases, at least those with conspicuous deter- 
minants. These successes include Huntington's 
disease, Alzheimer's disease, and some forms of 
breast cancer. However, the detection of ge- 
netic factors for complex diseases — such as 
schizophrenia, bipolar disorder, and diabetes — 
has been far more complicated. There have 
been numerous reports of genes or loci that 
might underlie these disorders, but few of these 
findings have been replicated. The modest na- 
ture of the gene effects for these disorders likely 
explains the contradictory and inconclusive 
claims about their identification. Despite the 
small effects of such genes, the magnitude of 
their attributable risk (the proportion of people 
affected due to them) may be large because they 
are quite frequent in the population, making 
them of public health significance. 

Has the genetic study of complex disorders 
reached its limits? The persistent lack of 
replicability of these reports of linkage be- 
tween various loci and complex diseases 
might imply that it has. We argue below that 
the method that has been used successfully 
(linkage analysis) to find major genes has lim- 
ited power to detect genes of modest effect, 
but that a different approach (association 
studies) that utilizes candidate genes has far 
greater power, even if one needs to test every 
gene in the genome. Thus, the future of the 
genetics of complex diseases is likely to require 
large-scale testing by association analysis. 

How large does a gene effect need to be in 
order to be detectable by linkage analysis? 
We consider the following model: Suppose a 
disease susceptibility locus has two alleles A 
and a, with population frequencies p and q = 
1 - p, respectively. There are three geno- 
types: AA, Aa, and aa. We define genotypic 
relative risks (GRR, the increased chance 
that an individual with a particular genotype 
has the disease) as follows: Let the risk for 
individuals of genotype Aa be y times greater 
than the risk for individuals with genotype 
aa, a GRR of y. We assume a multiplicative 
relation for two A alleles, so that the GRR 
for genotype AA is y 2 . The method of link- 



N. Risch is in the Department of Genetics, Stanford 
University School of Medicine, Stanford, CA 94305-5120, 
USA. E-mail: risch@lahmed.stanford.edu. K. Merikangas 
is in the Departments of Epidemiology and Psychiatry, 
Unit, Yale University School of Medicine, New Haven, 
CT 06510, USA. E-mail: kath@zeus.psych.yaie.edu 



age analysis we have chosen for this argu- 
ment is a popular current paradigm in which 
pairs of siblings, both with the disease, are 
examined for sharing of alleles at multiple 
sites in the genome defined by genetic mark- 
ers. The more often the affected siblings 
share the same allele at a particular site, the 
more likely the site is close to the disease 
gene. Using the formulas in ( J ), we calculate 
the expected proportion Vof alleles shared by 
a pair of affected siblings for the best possible 
case — that is, a closely linked marker locus 
(recombination fraction 0 = 0) that is fully 
informative (heterozygosity = 1) (2) — as 
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If there is no linkage of a marker at a 
particular site to the disease, the siblings 
would be expected to share alleles 50% of the 
time; that is, Y would equal 0.5. Values of Y 
for various values of p and y are given in the 
third column of the table. For an allele of 
moderate frequency (p is 0.1 to 0.5) that con- 
fen a GRR (y) of fourfold or greater, there is a 
detectable deviation of Y from the null value of 
0.5. On the other hand, for an allele conferring 
aGRRof2orless, the expected marker-sharing 
only marginally exceeds 50%, for any allele 
frequency (p). Thus, it is clear that the use of 



linkage analysis for loci conferring GRR of 
about 2 or less will never allow identification 
because the number of families required 
(more than -2500) is not practically achiev- 
able. 

Although tests of linkage for genes of mod- 
est effect are of low power, as shown by the 
above example, direct tests of association with 
a disease locus itself can still be quite strong. 
To illustrate this point, we use the transmis- 
sion/disequilibrium test of Spielman etal. (3). 
In this test, transmission of a particular allele 
at a locus from heterozygous parents to their 
affected offspring is examined. Under Mende- 
lian inheritance, all alleles should have a 50% 
chance of being transmitted to the next gen- 
eration. In contrast, if one of the alleles is 
associated with disease risk, it will be trans- 
mitted more often than 50% of the time. 

For this approach, we do not need families 
with multiple affected siblings, but can focus 
just on single affected individuals and their 
parents. For the same model given above, we 
can calculate the proportion of heterozygous 
parents as pq(y + l)/(pry + q)(4). Similarly, 
the probability for a heterozygote parent to 
transmit the high risk A allele is just yf( 1 + y). 
Association tests can also be performed for 
pairs of affected siblings. When the locus is 
associated with disease, the transmission excess 
over 50% is the same as for single offspring, but 
the probability of parental heterozygosity is in- 
creased at low values of p; for higher values of p, 
the probability of parental heterozygosity is de- 
creased. The formula for parental heterozygos- 
ity for an affected pair of siblings for the same 
genetic model as used in the first example is 
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Comparison of linkage and association studies. Number of families needed for identification of a 
disease gene. 
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On the right side of the table, we present 
trie proportion of heterozygous parents (Het) 
and the probability of transmission of the A 
allele from a heterozygous parent to an af- 
fected child (P(tr-A)] for the same values of 
GRR as considered above for the example of 
linkage analysis. The deviation from the null 
hypothesis of 50% transmission from het- 
erozygous parents is substantially greater 
than the excess allele sharing that is found by 
linkage analysis in sibling pairs. This dispar- 
ity between the methods is particularly true 
for lower values of Y (that is, with lower rela- 
tive risk). For example, for y = 1.5, allele 
sharing is at most 51%, while the A allele is 
transmitted 60% of the time from heterozy- 
gous parents. 

In this respect then, association studies 
seem to be of greater power than linkage 
studies. But of course, the limitation of as- 
sociation studies is that the actual gene or 
genes involved in the disease must be tenta- 
tively identified before the test can be per- 
formed. In fact, the actual polymorphism 
within the gene (or at least a polymorphism in 
strong disequilibrium) must, be available. 
However, we show that this requirement is 
only daunting because of limitations imposed 
by current technological capabilities, not be- 
cause sufficient families with the disease are 
not available or the statistical power is inad- 
equate (5). For example, imagine the time 
when all human genes (say 100,000 in total) 
have been found and that simple, diallelic 
polymorphisms in these genes have been 
identified. Assume that five such diallelic 
polymorphisms have been identified within 
each gene, so that a total of 10 x 10 5 = 10 6 
alleles need to be tested. The statistical prob- 
lem is that the large number of tests that need 
to be made leads to an inflation of the type 1 
error probability. For a linkage test with pairs 
of affected siblings, we use a lod score (loga- 
rithm of the odds ratio for linkage) criterion 
of 3.0, which asymptotically corresponds to a 
type 1 error probability a of about 10 -4 . In a 
linkage genome screen with 500 markers, 
this significance level gives a probability 
greater than 95% of no false positives. The 
equivalent false positive rate for 1,000,000 
independent association tests can be ob- 
tained with a significance level a = 5 x 10"*. 

We illustrate the power of linkage versus 
association tests at different significance lev- 
els by determining the sample size N (num- 
ber of families) necessary to obtain 80% 
power (the probability of rejecting the null 
hypothesis when it is false) (6) (see table). 
With a linkage approach and a disease gene 
with a GRR of 4 or greater, the number of 
affected sibling pairs necessary to detect link- 
age is realistic (185 or 297), provided the 
allele frequency p is between 5 and 75%. For 
a gene with a GRR of 2 or less, however, the 
sample sizes are generally beyond reach (well 



over 2000), precluding their identification 
by this approach. In contrast, the required 
sample size for the association test, even al- 
lowing for the smaller significance level, is 
vastly less than for linkage, especially for af- 
fected sibling pair families when the value of 
. p is small. Even for a GRR of 1.5, the sample 
sizes are generally less than 1000, well within 
reason. 

Thus, the primary limitation of genome- 
wide association tests is not a statistical one 
but a technological one. A large number of 
genes (up to 100,000) and polymorphisms 
(preferentially ones that create alterations in 
derived proteins or their expression) must first 
be identified, and an extremely large number 
of such polymorphisms will need to be tested. 
Although testing such a large number of poly- 
morphisms on several hundred, or even a 
thousand families, might currently seem im- 
plausible in scope, more efficient methods of 
screening a large number of polymorphisms 
(for example, sample pooling) may be pos- 
sible. Furthermore, the number of tests we 
have used as the basis for our calculations 
(1,000,000) is likely to be far larger than nec- 
essary if one allows for linkage disequilibrium, 
which could substantially reduce the required 
number of markers and families needed for 
initial screening. 

Some of the important loci for complex 
diseases will undoubtedly be found by link- 
age analysis. However, the limitations to de- 
tecting many of the remaining genes by link- 
age studies can be overcome; numerous ge- 
netic effects too weak to identify by linkage 
can be detected by genomic association stud- 
ies. Fortunately, the samples currently col- 
lected for linkage studies (for example, af- 
fected pairs of siblings and their parents) can 
also be used for such association studies. 
Thus, investigators should preserve their 
samples for future large-scale testing. 

The human genome project can have 
more than one reward. In addition to se- 
quencing the entire human genome, it can 
lead to identification of polymorphisms for 
all the genes in the human genome and the 
diseases to which they contribute. It is a 
charge to the molecular technologists to de- 
velop the tools to meet this challenge and 
provide the information necessary to identify 
the genetic basis of complex human diseases. 
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let E[8j) = ji and Var(S-) = o 2 For a sample of sizeL^ 
M f let f= ZSj/V/w! Then under H 0 , Talso has mean Jj* 
0 and variance 1, while under H 1§ it has mean VmQO 
ji and variance a 2 . We assume that T is approxi- p"" 
mately normally distributed both under Hq and H v TT\ 
Then the sample size M required to obtain a power 

of 1 - 0 for a significance level a is given by 

M={Z a -oZ^fj> 2 /n 2 (1) O 

For each affected sib pair, we score the number of "5 
alleles shared ibd from each of ZN parents. Define "X 

= 1 if an allele is shared from the fth parent and 
0 : s -1 if unshared. Under the null hypothesis of 
no linkage. P[B, = 1) = P[% = -1) = 0.5. so £(85) = 
0 and Vartfl^) = 1. For the genetic model described 
above with genotypic relative risks of y 2 y, and 1, 
allele sharing by affected sibs is independent for 
the two parents; thus, we can consider sharing of 
alleles one parent at a time. Thus, for affected sib 
pairs assuming 9 = 0 and no linkage disequilibrium, 
the formula is 

2n Z 



where 



H = 2V-1 
a 2 = 4Y(1-V) 

y 1 + W 

pg(y-i) 2 
w= — 

Z a = 3.72 (corresponding to a = 10" 4 ), and Z y _ p 
= -0.84 (corresponding to 1 - 0 = 0.80). For an 
association test using the transmission/disequili- 
brium test, with the disease locus or a nearby lo- 
cus in complete disequilibrium, the number (A/) of 
families with affected singletons required for 80% 
power is also calculated from formula 1. For this 
case, we score the number of transmissions of allele 
A from heterozygous parents. Let h be the probabil- 
ity a parent is heterozygous under the alternative 
hypothesis, namely, h = pq(y + 1 )/(py + g). Then de- 
fine 8j ■ rr 0 5 if the parent is heterozygous and al- 
lele A is transmitted; 8j = 0 if the parent is homozy- 
gous; and Bj = -/T 0 - 5 if the parent is heterozygous 
and transmits allele a. Under the null hypothesis. 
a[8j) = 0 and Var<3) = 1. Under the alternative hy- 
pothesis, n = 33) = JRy - D/(Y + 1) and o 2 = 
Var{8-) = 1-/i(T-1r/ly+ 1 r- In this case, there are 
two parents per family and they act independently, 
so the required number (N) of families is given by 
half of formula 1 where u. and o 2 are givenatoove. 
Here. Z a = 5.33 (corresponding to a = 5 x IfT 8 ). For 
the same test but with affected sib pairs instead of 
singletons, the number of families required is given 
by half of formula 1 (transmissions from two parents 
to two children) with the same formulas for n and o 2 
as for singleton families but now using the heterozy- 
gote frequency for parents of affected sib pairs. Us- 
ing the above formulas, we can calculate sample 
sizes for the three study designs. 
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ABSTRACT We have developed hiflh-density DNA mi- 
croarrays of yeast ORFs. These microarrays can monitor 
h ybridisation to O KFs ' TO applications such as quantitative 
differential gene expression analysis and screening for se- 
quence polymorphisms. Automated scripts retrieved sequence 
information from public databases to locate predicted ORFs 
and select appropriate primers for amplification. The primers 
were used to amplify yeast ORFs in 96-well plates, and the 
resulting products were arrayed using an automated micro 
arraying device. Arra ys containi ng nn tn 2 , 47 Q .. 1 VfiasinQBf n § 
were printed on a sinek-sU dfi. The hybridization of f luores- 
cently labeled samples to the array were detected and quan- 
titated with a laser confocal scanning microscope. Applica- 
tions of the microarrays are shown for genetic and gene 
expression analysis at the whole genome level. 

The genome sequencing projects have generated and will con- 
tinue to generate enormous amounts of sequence data. The 
genomes oiSaccharomyces cerevisiae, Haemophilus influenzae (1), 
Mycoplasma genitalium (2), and Methanococcus jannischii (3) 
have been completely sequenced. Other model organisms have 
had substantial portions of their genomes sequenced as well 
including the nematode Caenorhabditis elegans (4) and the small 
flowering plant Arabidopsis thaliana (5). Given this ever- 
increasing amount of sequence information, new strategies are 
necessary to efficiently pursue the next phase of the genome 
projects— the elucidation of gene expression patterns and gene 
product function on a whole genome scale. 

One important use of genome sequence data is to attempt 
to identify the functions of predicted ORFs within the genome. 
Many of the ORFs identified in the yeast genome sequence 
were not identified in decades of genetic studies and have no 
significant homology to previously identified sequences in the 
database. In addition, even in cases where ORFs have signif- 
icant homology to sequences in the database, or have known 
sequence motifs (e.g., protein kinase), this is not sufficient to 
determine the actual biological role of the gene product. 
Experimental analysis must be performed to thoroughly un- 
derstand the biological function of a given ORFs product. 
Model organisms, such as S. cerevisiae, will be extremely 
important in improving our understanding of other more 
complex and less manipulate organisms. 

To examine in detail the functional role of individual ORFs and 
relationships between genes at the expression level, this work 
describes the use of genome sequence information to study large 
numbers of genes efficiently and systematically. The procedure 
was as follows. (/) Software scripts scanned annotated sequence 
information from public databases for predicted ORFs. (w) The 
start and stop position of each identified ORF was extracted 
automatically, along with the sequence data of the ORF and 200 

The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement" in 
accordance with 18 U.S.C §1734 solely to indicate this fact. 
© 1997 by The National Academy of Sciences 0027-8424/97/94 13057-6S2.00/0 
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bases flanking either side, (fa) These data were used to automat- 
ically select PCR primers that would amplify the ORF. (rV) The 
primer sequences were automatically input into the automated 
multiplex oligonucleotide synthesizer (6). (v) The oligonucleo- 
tides were synthesized in 96-well format, and (W) used in 96-well 
format to amplify the desired ORFs from a genomic DNA 
template. (vw) The products were arrayed using a high-density 
DNA arrayer (7^T0)TThe gene arrays can be used for hybridiza - 
ti on with a variet y o f labeled productsjiuc^^ 
expression analysSor genjgrcujfo^ 

genomic mismaitcTTscaimmg punfiecTtj^Angor genotyping (ll). 
METHODS 

Script Design. All scripts were written in UNDC Tool Command 
Language. Annotated sequence information from GenBank was 
extracted into one file containing the complete nucleotide se- 
quence of a single chromosome. A second file contained the 
assigned ORF name followed by the start and stop positions of that 
ORF. The actual sequence contained within the specified range, 
along with 200 bases of sequence flanking both sides, was extracted 
and input into the primer selection program PRIMER 0.5 (White- 
head Institute, Boston). Primers were designed so as to allow 
amplification of entire ORFs. The selected primer sequences were 
read by the 96-well automated multiplex oligonucleotide synthe- 
sizer instrument for primer synthesis. The forward and reverse 
primers were synthesized in two separate 96-well plates in corre- 
sponding wells. All primers were synthesized on a 20-nmol scale. 

ORF Amplification and Purification. Genomic DNA was iso- 
lated as described (12) and used as template for the amplification 
reactions. Each PCR was done in a total volume of 100 p\. A total 
of 0.2 yM each of forward and reverse primers were aliquoted into 
a 96-well PCR plate (Robbins Scientific, Sunnyvale, CA); a master 
mix containing 0.24 mM each dNTP, 10 mM Tris (pH 8.5); 50 mM 
MgCl 2 , 25 units Taq polymerase, and 10 ng of template was added 
to the primers, and the entire mix was thermal cycled for 30 cycles 
as follows: 15 min at 94°C, 15 min at 54°C, and 30 min at 72°C. 
Products were ethanol precipitated in polystyrene v-bottom 96- 
well plates (Costar). All samples were dried and stored at -20 P C. 

Arraying Procedure and Processing. Microarrays were 
made as described (8). 

A custom built arraying robot was used to print batches of 48 
slides. The robot utilizes four printing tips which simultaneously 
pick up ~1 /xl of solution from 96-well microtiter plates. After 
printing, the microarrays were rehydrated for 30 sec in a humid 
chamber and then snap dried for 2 sec on a hot plate (100°C). The 
DNA was then U V crosslinked to the surface by subjecting the 
slides to 60 millijoules of energy. The rest of the poly-L-lysine 
surface was blocked by a 15-min incubation in a solution of 70 mM 
succinic anhydride dissolved in a solution consisting of 315 ml of 
l-methyl-2-pyrrolidinone (Aldrich) and 35 ml of 1 M boric acid 
(pH 8.0). Directly after the blocking reaction, the bound DNA 
was denatured by a 2-min incubation in distilled water at ~95°C. 



Abbreviation: YEP, yeast extract/ peptone. 

tTo whom reprint requests should be sent at the present address: 
Synteni, Inc., 6519 Dumbarton Circle, Fremont, CA 94555. 
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DISCUSSION 

The results of these experiments show that many genes are 
differentially expressed under the three environmental condi- 
tions described here. The expected and predicted changes in gene 
expression, such as HSP12 in the heat-shocked culture, TIP1 in 
the cold-shocked culture, and GAL2 in the steady-state galactose 
culture, were observed in every case. However, in addition to the 
expected changes in gene expression, significant differential 
expression was also observed for many other genes that would 
not, a priori, be expected to be differentially expressed. For 
example, expression of PHOll decreased and expression of 
YLR194, KIN2, and HXT6 increased in the heat shocked culture. 
Expression of MST1 and APE3 decreased and expression of 
PDR5 and GAR1 increased in the cold-shocked culture. In 
addition, ADE4 and SER2 were expressed at reduced levels 
whereas PH084 and ACH1 were expressed at higher levels in 
cells grown in galactose compared with cells grown in glucose. 
Differential expression of these and many other genes was specific 
to one of these three environmental conditions. 

Many other genes were found to be differentially expressed 
under more than one condition. When differentially expressed 
genes in cold- and heat-shocked cultures were compared, 30 
genes were found in common. Of these 30 genes, 28 showed 
inverse expression (i.e., increased expression under one condition 
and decreased expression under the other condition). Two genes, 
YCR058 and YKL102, showed elevated expression in response to 
both cold and heat shock. Fifteen genes were found to be 
differentially expressed in both the heat-shocked and steady-state 
galactose cultures: 9 genes showed increased expression and 5 
showed decreased expression under both conditions. Twenty 
genes were differentially expressed in both the cold-shocked and 
steady-state galactose cultures: 8 genes showed decreased expres- 
sion and 5 genes showed increased expression under both con- 
ditions. Six genes showed increased expression in the galactose 
culture and decreased expression in the cold shocked culture. 
One gene (ODP1) showed increased expression in both the 
cold-shocked and steady-state galactose cultures. 

Gene expression is affected in a global fashion when environ- 
mental conditions are changed and both expected and unex- 
pected genes are affected. There is also overlap in the genes that 
are differentially expressed under quite different environmental 
conditions. These results can be rationalized by considering the 
high degree of cross-pathway regulation in yeast. For example, 
there is evidence for cross-pathway regulation between (/) carbon 
and nitrogen metabolism (18), (a) phosphate and sulfate metab- 
olism (19), and (Hi) purine, phosphate, and amino acid metabo- 
lism (20-24). There are also examples of the interaction of 
general and specific transcription factors (25, 26). Finally, within 
the broad class of amino acid biosynthetic genes, there is evidence 
for amino acid specific regulation of some genes, regulation via 
general control for other genes, and regulation via both specific 
and general control for other genes (22, 27-30). 

Cross-pathway regulation arises from the complex structure 
of promoters. Virtually all promoters contain sites for multiple 
transcription factors and, therefore, virtually all genes are 
subject to combinatorial regulation. For example, the HIS4 
promoter contains binding sites for GCN4 (the general amino 
acid control transcription factor), PH02/BAS2 (a transcrip- 
tional regulator of phosphatase and purine biosynthetic 
genes), and BAS1 (a transcriptional regulator of purine bio- 
synthetic genes) (31). It is likely that the complex effects on 
gene expression described in this work are a direct conse- 
quence of the combinatorial regulation of gene expression. 

These findings illustrate the power of the highly parallel whole 
genome approach when examining gene expression. The global 
effects of environmental change on gene expression can now be 
directly visualized. It is clear that determining the mechanism(s) 
and the functional role of the dramatic global effects on gene 
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expression in different environments will be a significant chal- 
lenge. The era of whole genome analysis will, ultimately, allow 
researchers to switch from the very focused single gene/promoter 
view of gene expression and instead view the cell more as a large 
complex network of gene regulatory pathways. 

With the entire sequence of this model organism known, new 
approaches have been developed that allow for genome wide 
analyses (32, 33) of gene function. The genome microarrays 
represent a novel tool for genetic and expression analysis of the 
yeast genome. This pilot study uses arrays containing >35% of 
the yeast ORFs and it is clear that the entire set of ORFs from 
the yeast genome can be arrayed using the directed primer based 
strategy detailed here. Recent advances in arraying te chnology 
w ill allow all 6,100 ORFslto be arrayedin an area ^QtlgjjtjiaflJP? 
cjaij^thermore, a^ limits 
will allow less than 500 ng of starting mRNA material to be used 
for making probe. 

The genome arrays provide for a robust, fully automated 
approach toward examining genome structure and gene func- 
tion. They allow for comparisons between different genomes 
as well as a detailed study of gene expression at the global level. 
This research will help to elucidate relationships between 
genes arid allow the researcher to understand gene function by 
understanding expression patterns across the yeast genome. 

Support was provided by National Institutes of Health Grant 
PO/HG00205. 
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D-0.1 M KCI. Tat-SF/pp140 was etuted with in- 
creasing salt concentrations and was detected 
mostly in 0.2 to 0.4 M KCI fractions. These fractions 
were pooled, dialyzed against buffer D-0.1 M KCI, 
and loaded onto a glutathione Sepharose (Pharma- 
cia) column containing GST-Tat fusion proteins. After 
the column was washed with buffer D-0.4 M KCI, 
Tat-SF/pp140 was eluted from the column with buff- 
er D containing 1.4 rVI KCI. The estimated overall 
purification after these steps was -3000-fold . In the 
experiment shown in Fig. 3, the 0.2 to 0.4 M KCI 
heparin Sepharose fraction containing Tat-SF activ- 
ity was subjected to fractionation through an Affi-Gel 
1 0 matrix column (Bio-Rad) containing immobilized 
Tat. Tat-SF activity was eluted from the column with 
increasing salt concentrations. The 0.6 M KCI frac- 
tion was analyzed as described in Fig. 3. 
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Accessing Genetic Information with 
High-Density DNA Arrays 
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Rapid access to genetic information is central .tb the revolution taking place in molecular 
genetics. The simultaneous analysis of the entire human mitochondrial genome Is de- 
scribed here. DNA arrays containing up toj[o5 t 000 probes complementary to the 16.6- 
kUobase.!MT^ 

^Jh^is^A two-color labeling scheme was developed that allows simultaneous compar- 
ison of a polymorphic target to a reference DNA or RNA. Complete hybridization patterns 
were revealed in a matter of minutes. Seque nce polym p rphjsms„were. detected with 
jangl e- base resolution an d unprecedented e^iciency. The methods described are ge- 
neric a'ntfcarTBe' usecTfo adcfress aTarlety of questions in molecular genetics including 
gene expression, genetic linkage, and genetic variability. 



The fundamentals of light-direcred oli- 
gonucleotide array synthesis have been de- 
scribed (5, 6). Any probe can be synthe- 
sized at any discrete, specified location in 
the array, and any set of probes composed of 
the four nucleotides can be synthesized in n 
maximum of 4N cycles, where N is the, 
length of the longest probe in the aj:ray..ji»r 
example, the entire set of ~10 12 'Z0-nucle- 
otide oligomer probes, or any desired subset, 
can be synthesized in only 80 coupling cy- 
c ^ es ' Tfo e jyrabgr of £ ifferenuprobes Jih\& 
can be^synthesized is liinited only^bv^rhe, 
lesion ^iz?<^h^^ acKievahk* 

An array consisting of oligonucleotides 
complementary' to subsequences of a tarter 
sequence can be used to determine the iden- 
tity of a target sequence, measure its amount, 
and detect differences between the target 
and a reference sequence. Many different 
arrays can be designed for these purposes. 
One such design, termed a 4L tiled array, is 
depicted in Fig. 1A. In each set of four 
probes, the perfect [ complement will hybrid- 
ize more strongly than mismatched probes 
By this approach, a_nudejc .acjej „|argtt or 
length L can he f scannei(.i for* mutations -wirh 
a .tiled... array. containing 4L probes. For ex- 
ample , to query the 16,569 base pairs (bp) of 
human mitochondrial DNA (mtDNA), onlv 
66,276 probes of the possible —10° 15-nu- 
cleotide oligomers need to be used. 

The use of a tiled array of probes to read ;« 
target sequence is illustrated in Fig. 1C. A 
tiled array of 1 5-miclcotide oligomers varied 



A central theme in modem genetics is the 
relation between genetic variability and phe- 
notype. To understand genetic variation and 
its consequences on biological function, an 
enormous effort in comparative sequence 
analysis wiii need to be carried out. Conven- 
tional nucleic acid sequencing technologies 
mate use of analytical separation techniques, 
tolest^lve^eqX 
leveTT/7 ZJ'^H^ 

increases, linearly ^wjith, jhe _am^nt of se- 
quence^ In contrast, biological systems read, 
store, and modify generic information by mo- 
lecular recognition (3). Because each DNA 
strand carries with it the capacity to recognize 
a uniquely complementary sequence through 
base pairing, the process of recognition, or 
hybridization, is highly parallel, as every nu- 
cleotide in a large sequence can in pfinciple 
be auerFeU at the . same.., time. .Thus, hybrid- 
ization can be used to efficiency ' analyze 
large amounts oi nucleotide sequence. In one 
proposal, sequences are analyzed by hybrid- 
ization to a set of oligonucleotides represent- 
ing all possible subsequences (4). ^L^ecund 
; OPmachaiS^^ an 
jarrav of oligonucleotide probes designed to 



f 4 



Implementation of these concepts relies on 
recent 1\ developed combinatorial technolo- 
gies to generate any ordered array of a large 
number of oligonucleotide probes (5). 

""" " ,/> 
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Fig. 3 - Hum ? n mit0 \ 
-I vicinal genome on a 
\ty 0t .A) An image of the 
liov hybridized to 16.6 
^Vi mitochondrial target 
SNA (L strand). The 
559 -bp ma P °* tne 
~ Vne is shown, and 
-; strand origin ofrep- 
-3i:cn (0 H ). located in 
t p e control region, is indi- 
-red. lB) A portion of 
.h 8 ! hybridization pattern 
iVaqnified. In each col- 
...^n there are five 
~c£es:A, C.G.T, and A, 
torn top to bottom. The 
Aerobe has a single- 
Ease deletion instead of a 
Substitution and hence is 
9j instead of 25 bases in 
length. The scale is indi- 
cated by the bar beneath 
l e image. Although 
•rare is considerable se- 
quence-dependent in- 
tensity variation, most of 
:he array can be read di- 
rectly. The image was 
collected at a resolution 
of - 1 00 pixels per probe 
cell. (C) The ability of -the 
array to detect and read 

single-base differences in a 1 6.6-kb sample is illustrated. Two different target sequences were hybndized 
; n parallel to different chips. The hybridization patterns are compared for four different positions in the 
sequence. Only the P 25 - 13 probes are shown. The top panel of each pair shows the hybridization of the mt3 
target which matches the chip P° sequence at these positions. Trie lower panel shows the pattern 
Generated by a sample from a patient with Leber's hereditary optic neuropathy (LHON). Three known 
pathogenic mutations, LHON3460. UHON4216, and LHON13708, are dearly detected. For comparison, 
the fourth panel in the set shows a region around position 1 1 .778 that is identical in both samples. 




provide the foundation for a powerful ge- 
netic analysis technology.. The method 
can be used to characterize the spectrum 
of sequence variation in a population and 
can be applied to the analysis of many 
tjenes in parallel. In_the case of human 
"it DNA, we simultaneously!^ 
control region K j3..piateiirix0ding. -genes, 



tRNA genes .^a 



^enes...The methods described here can be 
applied to other research areas in molec- 
ular genetics; for example, the ability to 
identify and sequence polymorphisms pro- 
vides a basis for genetic mapping. The 
specificity of oligonucleotide hybridiza- 
tion and the scalability of the method 
<ui!»ests the possibility of a dedicated array 
that could be used to generate a_hig}v 
e vo l ution ^ggnp.tjr map of an entire ge - , 
no me. in^..single_ej<peximerjtt. Likewise^pT 
I be concepts and techniques described 
Iv.re have been used to develop approach- 
e* lor mRNA identification and the large- 
^cale, parallel measurement of expression 
lewis (24). Thus, the sequence of a gene, 
n> spectrum of change in the population, 
'hromosomal locatiory, and its dynam/ jL 



ics of expression (all essential to a full 
understanding of function) can be deter- 
mined with high-density probe arrays. The 
challenge now is to. synthesize and read 
probe arrays at even higher density. For 
example, a 2 cm by 2 cm array, synthesized 
with probes occupying 1-u.m synthesis 
sites in a 4L tiling, could query the entire 

y 



nd 2 ribosomaJ^RNA_ coding content of the human genome, 



estimated at 100,000 genes. 
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-4 x 10° functional copies of a specific probe, ' 
which corresponds to a mean distance of about 100 
A between probes (M. 0. Trulson. D. Stern. R. P. 
Rava. unpublished results). 
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9. The control region of mtDNA is characterized by high 
amounts of sequence polymorpnism concentrated 
in two hypervariabte regions [B. D. Greenberg, J. E. 
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Aquardo and B. D. Greenberg. Genetics 103, 287 
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10. R. L. Cann. W. M. Brown, A. C. Wilson, Genetics 
106.479(1984). 

1 1 . The mt1 and mt2 sequences were cloned from am- 
plified genomic DNA extracted from hair roots [P. 
Gill, A. J. Jeffreys. 0. J. Werrett. Nature 318. 577 
(1985); R. K. Saiki et a/.. Science 239. 487 (1988)]. 
The clones were sequenced conventionally (7). Clon- 
ing was performed only to provide a set of pure 
reference samples of known seouence. For tem- 
plates for fluorescent labeling, DNA was reamplified 
from the clones with primers bearing bacteriophage 
T3 and T7 RNA polymerase promoter sequences 
(bold; mtDNA sequences uppercase): L15935-T3. 
5 ' -ctcggaattaaccctcadaaaggAAACCT T T TTCC- 
AAGGA and H667-T 7, 5'-taatacgactcactataggga- 
gAGGCTAGGACCAAACCTATT. 

1 2. Labeled RNAs from the two complementary mtDNA 
strands [designated L and H (8)1 were transcribed in 
separate reactions from a promoter-tagged poly- 
merase chain reaction (PCR) product. Each 10- >J 
reaction contained 1 .5 mM each of the triphosphate 
nucleotides ATP, CTP, GTP, and UTP; 0.24 mM 
fluorescein- 1 2-CTP (Du Pont); 0.24 mM fluorescein- 
12-UTP (Boehringer Mannheim): - 1 to 5 nM (1 .5 p-0 
crude unpurified 1 .3-kb PCR product; and T3 or T7 
RNA polymerase (1 U/p.l) (Promega) in a reaction 
buffer supplied with the enzyme. The reaction was 
earned out at 37 3 C fOr 1 to 2 hours. RNA was frag- 
mented to an average size of < 100 nucleotides by 
adjusting the solution to 30 mM MgCI 2 , by the addi- 
tion of 1 M MgCt 2 , and heating at 94°C for 40 min. 
Fragmentation improved the uniformity and specific- 
ity of hybridization (M. Chee ef a/., data not shown). 
The extent of fragmentation is dependent on the 
magnesium ion concentration (J. W. Huff, K. S. Sas- 
try, M. P. Gordon, W. E. C. Wacker, Biochemistry 3. 
501 (1964); J. J. Butzow and G. L Eichom, Btopoty- 
mers 3. 95 (1965)]. Good hybridization results have 
been obtained with both DNA and RNA targets pre- 
pared with a variety of labeling schemes, including 
incorporation of fluorescent and biotinylated de- 
oxynucleoside triphosphates by DNA polymerases, 
incorporation of dye-labeled primers during PCR, 
ligation of labeled oligonucleotides to fragmented 
RNA, and direct labeling by photo -cross -linking a 
psoralen derivative of biotin directly to fragmented 
nucleic acids (L. Wodicka, personal communication). 

13. For two-color detection experiments, the reference 
and unknown samples were labeled wit h biotin and 
fl uorescein, respectively, in s eparate transrrintinn re- 
actions. Reactions were carried out as described 
7/2)'except that each contained 1 .25 mM of ATP, 
CTP. GTP, and UTP and 0.5 mM fluorescein - 12- 
UTP or 0.25 mM biotin- 16 -UTP (Boehringer Mann- 
heim). The two reactions were mixed in the ratio 1:5 
(v/v) biotin:fluorescein and fragmented (72). Targets 
were diluted to a final concentration of -100 to 1000 
pM in 3M TMACI [W. B. Melchior Jr. and P. H. von 
Hippel, Proc. Natt. Acad. So. U.S.A. 70, 298 (1973)}, 
10 mM tris-HCJ, pH 8.0, 1 mM EDTA, 0.005% Triton 
X-100, and 0.2 nM control oligonucleotide labeled at 
the 5' end with fluorescein (5 ' -CTGAACGGTAG- 
CATCTTGAC). Samples were denatured at 95°C for 
5 min. chilled on ice for 5 min. and equilibrated to 
37 C C. A volume of 1 80 \l\ of hybridization solution was 
then added to the flow cell [R. Lipshutz et at. , Biotech- 
niques 1 9, 442 (1 995)] and the chip incubated at 37°C 
for 3 hours with rotation at 60 rpm. The chip was 
washed six times at room temperature with 6x SSPE 
(0.9 M NaCI, 60 mM NaH 2 P0 4 , 6 mM EDTA, pH 7.4), 
0.005% Triton X-100. Phycoe rythrin-conjugated 
streptavidin^ ^g/ml in 6x~~55Pb, U.UUb% Inlon 
X r TU0) was added and incubation continued at room 
temperature for 5 min. The chip was washed again 
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Light-Directed, Spatially Addressable Parallel 

Chemical Synthesis 



Stephen P. A. Fodor,* J. Leighton Read, Michael C. PiRRUNG,f 
Lubert Stryer,£ Amy Tsai Lu, Dennis Solas 



Solid-phase chemistry, photolabile protecting groups, 
and photolithography have been combined to achieve 
tight-directed, spatially addressable parallel chemical syn- 
thesis to yield a highly diverse set of chemical products. 
Binary masking, one of many possible combinatorial 
synthesis strategies, yields 2° compounds in n chemical 
steps. An array of 1024 peptides was synthesized in ten 
steps, and its interaction with a monoclonal antibody was 
assayed by epifluorescence microscopy. r^gh^density.ars-- 



rays formed by light-directed synthesis are potentially 
rich sources of chemical diversity for discovering new 
ligands that bind to biological receptors and for elucidat- 
ing principles governing molecular interactions. Hie gen- 
erality of this approach is illustrated by the light-directed 
synthesis of a cfinucleotide. Spatially directed synthesis of 
complex compounds could also be used for microfabrica- 
tion of devices. 



THE REVOLUTION IN MICROELECTRONICS HAS BEEN MADE 
possible by photolithography, a process in which Eght is 
used to spatially direct the simultaneous formation of many 
electrical circuits. We report a method that uses light to direct the 
simultaneous synthesis of many different chemical compounds. 
Synthesis occurs on a solid support. The pattern of exposure to light 
or other forms of energy through a mask, or by other spatially 
addressable means, determines which, regions of the support are 
activated for chemical coupling. Activation by light results from the 
removal of photolabile protecting groups from selected areas (Fig. 
1). After deprotcction, the first of a set of "building blocks" (for 
example, amino acids or nucleic acids, each bearing a photolabile 
protecting group) is exposed to the entire surface, but reaction 
occurs only with regions that were addressed by light in the 
preceding step. The substrate is then illuminated through a second 
mask, which activates a different region for reaction with a second 
protected building block. The pattern of masks used in these 
illuminations and the sequence of reactants define the ultimate 
products and their locations. The number of compounds that can be 
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synthesized by this technique is limited only by the number • 
synthesis sites that can be addressed with appropriate resolution. 
Combinatorial masking strategies can be used to form a large 
number of compounds in a small number of chemical steps. 
Moreover, a high degree of miniaturization is possible because the 
density of synthesis sites is bounded only by physical limitations on 
spatial addressability, in this case the cuffraction of light. Each 
compound is accessible and its position is precisely known. Hence, 
its interactions with other molecules can be assessed. 
Spatially localized photodeprotection. Spatially localized sub- 



strate activation can be accomplished by photolithographic tech- 
niques. Amino groups at the ends of linkers attached to a glass 
substrate were derivatized with nitroveratryloxycarbonyl (NVOC), 
a photoremovable protecting group (f). Photodeprotcction was 
effected by illumination of the substrate through a mask (a 100 urn 
by 100 p.m checkerboard) with alternating opaque and transparent 
elements. The free amino groups were fhiorescently labeled by 
treatment of the entire substrate surface with fluorescein isothiocya- 
nate (FITC). The substrate was then scanned in an epifluorescence 
microscope. The presence of a high-contrast fluorescent checker- 
board pattern with 100 jim by 100 ujn elements (depicted in red in 
Fig. 2) reveals that free amino groups were generated in specific 
regions by spatially localized photodeprotection. 

Light-directed peptide synthesis. Light-directed synthesis of two 
pentnpeptides was carried out as outlined in Fig. 3. The 1-hydroxy- 
benzotriazole (HOBt)-activated ester of NVOC- Leu (NVOC-Leu- 
OBt) was allowed to react with the entire surface of a substrate that 
had previously been derivatized with amino functional groups. After 
removal of the NVOC protecting group by uniform illumination, the 
substrate was treated with NVOC-Phc-OBt. Two repetitions of this 
cycle with NVOC-Gry-OBt generated a substrate containing NVOC- 
GGFL across the entire surface (2). Spatially localized photodepro- 
tection was then performed through a 50-um checkerboard mask. The 
surface was then treated with Not-tert-butyloxy carbonyl-O-tert-butyl- 
L-tyrosinc. Finally, the surface was uniformly illuminated to photoh/zc 
the remaining NVOC-GGFL sites and treated with NVOC-Pro-OBt. 
After removal of the protecting groups, the surface consists of an array 
of H 2 N-Tyr-Gly-Gly-Phe-Leu (YGGFL) and H 2 N-Pro-Gly-Gh/-Phc- 
Leu (PGGFL) peptides in 50 jim by 50 ujti elements. 

Antibody recognition of the peptide pattern. The pentapeptide 
array was probed with a mouse monoclonal antibody directed 
against 0-cndorphin. This antibody (called 3E7) binds YGGFL and 
YGGFM with nanomolar affinity (3) and requires the amino- 
terminal Tyr for high-affinity binding. A second incubation with 
fluorescein-labeled goat antibody to mouse was used to detect 
regions containing bound 3E7. As shown in Fig. 4, a high-contrast 
(> 12 : 1 intensity ratio) fluorescence checkerboard image shows that 
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group of dcoxycytidine (sec Fig. 8). 

The 3-D representation of the fluorescence intensity data in Fig. 
8 reproduces the checkerboard illumination pattern used during 
photolysis of the substrate. This result demonstrates that oligonu- 
cleotides as well as peptides can be synthesized by the light-directed 
method. 

Comparison to other methods and potential applications. We 
have introduced an approach for the simultaneous synthesis of a 
large number of compounds that combines solid-phase synthesis 
(11), photolabile protecting groups (J), and photolithography (12). 
The method can be applied to any solid-phase synthesis technique in 
which light can be used to generate a reactive group. We have used 
light-directed spatially addressable parallel chemical synthesis to 
synthesize large arrays of peptides. The light-directed formation of 
oligonucleotides attests to the versatility of the technique and 
suggests that it could be broadly applicable in making high-density 
arrays of chemical compounds. The ten-step binary synthesis results 
in the formation of 1024 peptides in 1.6 cm 2 . The 50-u.m checker- 
board pattern of alternating pentapeptides shows that 40,000 
compounds can be synthesized in 1 cm 2 . Our present capability for 
high-contrast photodeprotection is better than 20 u-m, which gives 
> 250,000 synthesis sites per square centimeter. There is no physical] 
reason why higher densities of synthesis sites cannot be achieved 
Indeed, high spatial resolution electron-beam lithography (—250 A) 
has been used to generate patterns at a density of 10 10 per square 
centimeter (13). 

It is interesting to compare the light-directed method with other 
techniques for parallel chemical synthesis. One approach is to 
physically segregate different reactants by pipeting them into differ- 
ent reaction vessels. For example, 96 peptides have been simultane- 
ously synthesized on the tips of pins by immersing them into 
different solutions that are contained in the chambers of a microtiter 
plate (14). The need for physical separation of reaction sites sharply 
limits the .number of compounds that can be made by the pin 
method. In contrast, very large numbers of peptides can be gener- 
ated by recombinant DNA approaches (9, 15). Millions of different 
peptide sequences can be expressed on the surface of phage by 
inserting randomly synthesized oligonucleotides into their genomes. 
Each phage done displays a different peptide. Although the pep- 
tides-on-phage are in suspension and are not fixed at defined 
locations, those that bind tightly to a receptor can be identified by 
panning, isolation of individual clones, and DNA sequencing. Only 
peptides that contain genetically coded amino acids can be generated 
by expression on phage. The recombinant and light-directed ap- 
proaches have distinctive strengths that are complementary. For 
example, a peptide identified by the phage method to have appre- 
ciable affinity for a receptor can serve as the kernel around which 
diversity is generated by light-directed synthesis. A synthesis might 
include custom chemical building blocks in addition to the standard 
set of L-amino acids (16). For example, d- amino acids could be 
introduced to make the peptide more resistant to proteolysis (17), 
and modified site chains (18) could be inserted to increase affinity. 

Parallel chemical synthesis could be used to explore molecular 
recognition processes in biology and other fields. For example, 
pharmaceutical discovery is increasingly based on an understanding 
of the way receptors and enzymes interact with specific ligands. The 
techniques described here allow the synthesis of large numbers of 
peptides or other oligomers that can be surveyed for binding to 
biological macromolecules. 

Fabrication of small devices such as microelectronic circuits relies 
on the chemistry of photoresists, vapor deposition, and ion implan- 
tation. The techniques described here enable the in situ synthesis of 
complex compounds on a microscalc. The methods of spatially 
addressable chemical synthesis may be used in conjunction with the 



microfabrication of circuitry. The union of these technologies 
find applications in novel detection devices containing arrays of 
biological receptors or other molecular recognition elements. 

The functional properties of molecules synthesized by the light- 
directed approach can be read in a variety of ways. As was shown 
here, the binding of a receptor such as an antibody can readily be 
detected fluorimetricaliy. Radioactive or chemiluminescent labels 
could also be used (19). The susceptibility of compounds in an array 
to modification by an enzyme or other catalyst could also be directly 
assayed. For example, the cleavage of a peptide at a site located 
between a fluorescent energy donor and acceptor would lead to 
increased fluorescence (20). Peptides that are effective substrates for 
phosphorylation by a kinase could be identified by monitoring the 
2 P pattern following incubation with enzyme and radiolabeled 
ATP (adenosine triphosphate). — 
P 7 Oligonucleotide arrays produced by light-directed synthesis could 
be used to detect complementary sequences in DNA and RNA. 
Such arrays would be valuable in gene mapping, fingerprinting, 
diagnostics, and nucleic acid sequencing. A sequencing method 
based on hybridization to a complete set of fixed-length oligonucle- 
otides immobilized individually as dots of a two-dimensional matrix 
has been proposed (21). It is noteworthy that the light-directed 
synthesis of all 65,536 possible octanucleotides (4 8 ) would fit into 
1.6 cm 2 with 50-u.m square sites, a resolution already achieved. 
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5. Not all light-activated syntheses can be represented as factored polynomials. Some 
can only be denoted by irreducible (prime) polynomials. 

6. Binary rounds and nonbiliary rounds can be interspersed as desired, as in 

P = (A + 0)(B)(C + D + 0)(E + F + G) 

The 18 compounds formed are ABCE, ABCF, ABCG, ABDE, ABDF, ABDG, 
ABE, ABF, ABG, BCE, BCF, BCG, BDE, BDF, BDG, BE, BF, and BG. The 
switch matrix S for this seven-step synthesis is 

1 1 1 1 1 1 1 1 1 000000000 

111111111111111111 

1 1 10000001 1 1000000 
s = 000 I 1 1000000 1 1 1000 
100100100100100100 
010010010010010010 
001001001001001001 

The round denoted by (B) places B in all products because the reaction area was 
uniformly activated (the mask for B consisted entirely of l's). The number of 
compounds k formed in a synthesis consisting of r rounds, in which the tth round 
has b, chemical reactants and z, nulls, is 

k « n (fc, + Zi\ 

and the number of chemical steps n is 

« = 2 bi. 

The number of compounds synthesized when b = a and z - 0 in all rounds is a*', 
compared with 2" for a binary synthesis. For n = 20 and a = 5, 625 compounds 
(all tetramers) would be formed, compared with 1.049 x 10 6 compounds in a 
binary synthesis with the same number of chemical steps. It should also be noted 
that rounds in a polynomial can be nested, as in 

[(A + (B + 0)(C + 0)] <D + 0) 

The products are AD, BCD, BD, CD, D, A, BC, B, C, and 0. 

7. The longest peptide formed is fYGAGTFLSF (the ami no-terminal b f and the 
carboxyl-terminal residue linked to the substrate is F). Because the solid-phase 
synthesis is carried out in the carboxyl-to-amino direction, F is the first, S is the 
second, and f is the tenth unit to be coupled. 

8. In cases where the monovalent interaction of a receptor with a ligand is very low 
(where the off rate is rapid), it may not be possible to detect the binding of the 
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Human genome diversity 

Diversity genomique humaine 
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Abstract - Human genome diversity studies analyse genetic variation among indivi- 
duals and between populations in order to understand the origins and evolution of 
anatomically modern humans (Homo sapiens sapiens). T he availabili ty of thrm<;anrk of 
D NA polymorphisms (genetjc markers) bring s analytic power to these studies. Human 
genome diversity studies have clearly shown that the large part of genetic variability is 
due to differences among individuals within populations rather than to differences 
between populations, effectively discrediting a genetic basis of the concept of 'race'. 
Evidence from paleontology, archaeology and genetic diversity studies is quite 
consistent with an African origin of modern humans more than 1 00 000 years ago. The 
evidence favors migrations out of African as the source of the original peopling of Asia, 
Australia, Europe and Oceania. An international program for the scientific analysis of 
human genome diversity and of human evolution has been developed. The Human 
Genome Diversity Project (HCDP) aims to collect and preserve biologic samples from 
hundreds of populations throughout the world, make DNA from these samples available 
to scientists and distribute to the scientific community the results of DNA typing with 
hundreds of genetic markers. (© Academie des sciences / Elsevier, Paris.) 

human genome diversity / genetic variability / Homo sapiens sapiens I modern human origins and 
evolution / polymorphic DNA markers / Human Genome Diversity Project. 

Risume' - Lcs Etudes de la diversity genomique humaine analysent la variation g^n&ique entrc 
individus et entrc populations afin de comprendre revolution de Tetre humain moderne {Homo 
sapiens sapiens). La mise en evidence d'un grand nombre de polymorphismes de l'ADN nucl£aire 
et mitochondrial a permis raffinement de ces analyses. Ces dtudes ont montre que la majeure 
partie de la variabilite gene'tique dtait due aux differences entre individus d'une meme population 
plutot quentre les populations elles-memes, infirmant ainsi une base gdndtique du concept de 
« race ». Les etudes de paleontologie, d'archeologie et de diversite" gene'tique ont montre de facon 
coherente une origine africaine des etres humains modernes et ceci, il y a plus de 100 000 ans. 
Certains d'entre eux ont ensuite migre' vers TAsie, TAustralie, TEurope et l'Oceanie. Un 
programme international ayant pour objectif Tanalyse scientifique de la diversite genomique et de 
revolution humaine est actuellement mis en place. Ce programme appele^ Diversite" genomique 
humaine a pour but de recueillir et de conserver les echantillons biologiques de centaines de popu- 
lations dans le monde entier, de fournir aux chercheurs de l'ADN de ces e*chantillons pour les 
caractdriser avec des centaines de marqueurs gdnetiques, et d'analyser ces donnees gdn&iques qui 
seront ensuite mises a la disposition de la cornmunaute' scientifique. (© Academie des sciences / 
Elsevier, Paris.) 

diversity genomique humaine / variability genetique / Homo sapiens sapiens I origine et evolution de I'&ic humain 
moderne / marqueurs genet iques de l'ADN 
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shown that indeed language and allele frequencies tend 
to be correlated among populations. Intriguingly, there 
are now hints in several unilingual populations of differ- 
ing Y chromosome and mtDNA contributions to genetic 
variation, in that the variation of the latter marker is 
increased as compared to that of the former. The conclu- 
sion here may be that the linguistic barrier is more readily 
breached by women through incorporation into a pre- 
dominantly unilingual population. _ 

Probably the most recent dramatic and novel result in 
human genetic diversity studies has been the sequencing 
of a portion of the control region of mtDNA extracted with 
great care from a Homo sapiens neanderthalensis fossil 
thought to be between 30 000 and 100 000 years old. 
This incredible experiment, following an admirably rig- 
orous design, provided the entire 3 79-bp sequence of the 
hypervariable region 1, as deduced from many cloned, 
overlapping, short PCR products. This sequence was 
shown to differ markedly from the corresponding 
sequence of all known modern human mtDNAs. The 
mean number of pairwise base substitution differences 
between the Neanderthal and modern human mtDNA 
hypervariable region 1 (27 differences) is more than three 
times that observed among humans (on average eight dif- 
ferences). The difference seems to be sufficient to place 
the Neanderthal sequence outside of the variation that 
occurs among humans. This is an exciting result indicat- 
ing that Neanderthals did not contribute mtDNA (and pre- 
sumably nuclear DNA) to modern humans. 

The above examples of results issuing from studies of 
human genetic/genome diversity show the importance in 
human biology of this field of research. The discrediting of 
a genetic basis of the concept of race, understanding the 
origin of modern humans and the details of the peopling 
of the world and the sequencing of Neanderthal mtDNA 
are hardly trivial undertakings, and there are many other 
interesting and important questions to be posed and 
answered. For instance, haplotypes, which are more 
informative than individual loci for the description of 
chromosomes of population founders, will become the 
genetic units for analyses of human genome diversity, 
which, in turn, will provide information on their origins, 
ages and evolution. The development of molecular poly- 
morphic markers, and many of them, provides a depth of 
analytic resolution and power heretofore unavailable for, 
and is clearly impinging on, research design of diversity 
studies. The concept of genome diversity is clearly 
embodied in developing future studies involving markers 
drawn from throughout the genome, hundreds of markers 
that are highly polymorphic, as well as thousands of the 
more stable (less mutation and thus less polymorphic) sin- 
gle nucleotide polymorphisms (SNPs) that detect variation 
on-average once every ~1 000 base pairs. Automatic typ- 
ing of both groups of markers is reality and allows the 
equivalent of diversity genome scans of thousands of indi- 
viduals. Put these together with methods for high through- 
put automatic DNA extraction from thousands of blood 

?™ R o ^S? d ; £ cl -?? ris ' Sciences de la vie / Ufe Sciences 
1998. 321,443-446 



/ samples collected fro1#ivorld-wide population samples 
/ of hundreds of individuals each and with analytic meth- 
j ods for calculating inter-population genetic distances and 
evolutionary trees and describing in detail geographic 
variation of populations, essentially based on functions of 
allele frequency distributions, and one has the ingredients 
for an organized international collaboration on human 
genome diversity. Indeed, the Human Genome Diversity 
Project, after a slow start, is gathering steam, impelled 
recently by a favorable evaluation of the field by a com- 
mittee of outstanding scientists and ethics specialists con- 
vened by the U.S. National Research Council. 

The Human Genome Diversity Project (HGDP) is a 
program for the scientific analysis of human genetic diver- 
sity and evolution. It aims to 1) collect and preserve 
biologic samples from populations throughout the world; 
2) make DNA from these samples available to scientists; 
and 3) distribute to the scientific community the DNA 
typing results. The HGDP will be organized as an inter- 
national collaboration of scientists who work on human 
variation (usually geneticists, physical anthropologists, 
paleontologists and archaeologists). Collaborators will 
provide blood samples from world populations and/or 
type the DNA from these samples. Collaborator activities 
will be coordinated by several major international repo- 
sitories, which will be responsible for receiving and 
processing blood samples, storing purified lymphocytes 
and the leukocyte fraction from peripheral blood, esta- 
blishing lymphoblastoid cell lines (LCLs) and extracting 
DNA from these resources for distribution to collabora- 
tors, A database, containing DNA typing results as well as 
ethnographic information, will be developed and main- 
tained online for collaborating scientists initially and 
eventually for the public. 

Ethical issues play a critical role in the research design 
and organization of this project. Protection of the auton- 
omy, privacy and welfare of those who participate in the 
project has been a central concern of those involved in 
this type of research. These obligations as they apply to 
individual subjects and, perhaps to populations, have 
been discussed and studied by the organizers of the pro- 
posed project, as well as by a subcommittee of the 
UNESCO International Bioethics Committee and the U.S. 
National Research Council. The project requires a chal- 
lenging application of the ethical principles used in other 
aspects of human genetics research. 

A preliminary project is planned that would bring 
together some 500-1 000 already-existing LCLs from 
populations in Africa, Europe, Asia, the Americas and 
Oceania. DNA from these LCLs will be distributed to 
collaborating scientists for testing with various micro- 
satellite and, perhaps, SNP markers in order to develop a 
common panel of hundreds (to thousands) of markers for 
use in the HGDP program. These cell lines are expected 
to be gathered this year. The research program will then 
follow. The goals of the extended research program are to 
obtain blood samples from 100 to 250 individuals from 
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phylogenetic analysis. 
1 i. The methods described by G. H. Learn etal. [J. Virol. 
70, 5720 (1 996)] were used to align DNA sequences 
(with the use of CLUSTALW plus manual adjust- 
ment), calculate genetic distances (with the use of 
DNADIST, using the maximum likelihood method), 
evaluate potential sample mixups, construct neigh- 
bor joining trees, and perform bootstrap analyses ' 
(1000 replicates). Sequence regions that could not 
be unambiguously aligned were removed from sub- 
sequent analyses. Each sequence was compared 
for phylogenetic relatedness to the entire set of pub- 
lished and available unpublished laboratory HIV da- 
tabase sequences. If after this analysis the viral se- 
quences from a mother and an infant appeared as a 
monophyletic group on a phylogenetic tree, they 
were judged to be phyfogenetically linked or to have 
a common ancestor not shared by sequences from 



any other individuals evaluated. Issues regarding the 
assignment of phylogenetic linkage are discussed in 
greater detail by Learn et a/. 

12. L M. Frenkelef a/., at www.sciencemag.orQ/feature/ 
data/974996.shl. * 

1 3. R. Liu et a/. , Cell 86, 367 (1 996). 

14. L M. Frenkel etal., unpublished data. 

15. M.-L Newell etal., Lancet 347, 213 (1996). 

16. P. Palumbo. J. Skumick, D. Lewis, M. Eisenberg J 
Acquir. Immune Defic. Syndr. Hum. Retrovirol 10 
436(1995). 

17. A. -McMichael. R. Koup, A. J. Ammann, N Eno 
J. Med. 334, 801 (1 996). y ' 

18. E. C. Holmes era/., J. Infect. Dis. 167, 1411 (1993) 

19. T. Uu etal., J. Immunol. 154,3147(1995). 

20. A. Hoffenbach et al. . ibid. 1 42, 452 (1 989). 

21 . G. Schochetman, S. Subbarao, M. L Kalish in Viral 
Genome Methods, K. W. Adolph, Ed. (CRC Press 
Boca Raton, R_ 1996). pp. 25-41 . 



22. E. L Delwart, M. P. Busch, M. L Kalish J W Mos- 
ley, J. I. Mullins. AIDS Res. Hum. Retrovir. 11 1 181 
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23. C. H. Contag era/., J. Virol. 71, 1292(1997). 
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Single-nucleotide polymorphisms (SNPs) are the most frequent type of variation in the 
human genome, and they provide powerful tools for a variety of medical genetic studies 
In a large-scale survey for SNPs, 2.3 megahases of human genomic DNA was examined 
by a combination of gel-based sequencing and high-density variation-detection DNA 
chips. A ota of 3241 candidate SNPs were identified. A genetic map was constructed 
showing the location of 2227 of these SNPs. Proto type genotvpino chips warn d^n^ n 
t hat allow simu l taneous fl ft no typin q of r mS MBaJhe Stil^ 
of human diversity at the nucleotide level and demonstrate the feasibility of large-scale 
identification of human SNPs. 



Although the Human Genome Project still 
has tremendous work ahead to produce the 
first complete reference sequence of the 
human chromosomes, attention is already 
focusing on the challenge of large-scale 
characterization of the sequence variation 
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among individuals (I). This genetic diver- 
sity is of interest because it explains the 
basis of heritable variation in disease sus- 
ceptibility, as well as harbors a record of 
human migrations. 

The most common type of human genet- 
ic variation is the SNP, a position at which 
two alternative bases occur at appreciable 
frequency (>1%) in the human population 
There has been growing recognition that 
large collections of mapped SNPs would 
provide a powerful tool for human genetic 
studies (], 2). SNPs can serve as genetic 
markers for identifying disease genes by 
linkage studies in families, linkage disequi- 
librium in isolated populations, association 
analysis of patients and controls, and loss- 
of-heterozygosity studies in tumors (J, 2). 



Although individual SNPs are less informa- 
tive than currently used genetic markers 
(3), they are more abundant and have 
greater potential for automation (4, 5). 

We performed an initial survey to iden- 
tify SNPs by using conventional gel-based 
DNA sequencing to examine sequence- 
tagged sites (STSs) distributed across the 
human genome. STSs are short genomic 
sequences that can be amplified from DNA 
samples by means of a corresponding poly- 
merase chain reaction (PCR) assay. From 
among 24,568 STSs used in the construc- 
tion of a physical map of the human ge- 
nome at the Whitehead institute for Bio- 
medical Research/MIT Center for Genome 
Research (6, 7), an initial collection of 
1139 STSs was chosen (8). These STSs 
contained a total of 279 kb of genomic 
sequence (9), with one-third from random 
genomic sequence and two-thirds from 3'- 
ends of expressed sequence tags (3'-ESTs) 
and primarily representing untranslated re- 
gions of genes. Each STS was amplified 
from four samples (10): three individual 
samples and a pool of 10 individuals (there- 
by permitting allele frequencies to be esti- 
mated among 20 chromosomes). The PCR 
products were subjected to single-pass DNA 
sequencing based on fluorescent-dye prim- 
ers and gel electrophoresis; sequence traces 
were compared by a computer program fol- 
lowed by visual inspection (11). Candidate 
SNPs w ere declared wh en two alleles w'^reT 
se^amongH^^ 

aflgte^present at "a^t^uenc y "greater than 
^X}£-iSlP 0 ^ e ^ sample. The tciin "can- 



dHatelSKfP 



oecause a subset of such 



pparent polymorphisms turn out to be se- 
quencing artifacts, as discussed below. 

The survey identified 279 candidate 
SNPs, distributed across 239 of the STSs. 
This corresponds ; to jjjjate . of MjeSN P^per 
1 °2Lkg^ and an ob- 

served nucleotide heterozygosity_pi - Ji_= 

sequences 

(3 -ESTs) showed a lower polymorphism 
rate than random genomic sequence (with 



So 
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//seen in a test set of 39 individuals fell 
Ao distinct clusters, corresponding to the 
Obssible genotypes (28). These clusters 
^tould then be used to assign genotypes for 
further samples (29). 

The cluster test was applied to the ~500 
candidate SNPs that worked well under 
multiplex amplification conditions: 75% 
passed the cluster test, and careful rese- 
quencing demonstrated that all such loci 
were true polymorphisms. The cluster test 
thus provides reliable confirmation of an 
SNP. The remaining 25% failed the cluster 
test, and resequencing revealed that half 
were false positives in the SNP screen and 
half were true polymorphisms (with the 
poor discrimination on the chip typically 
due to one allele hybridizing more weakly 
than the other). Thus, 88% of the candi- 
date SNPs proved to be true polymor- 
phisms, and 86% of true SNPs passed the 
cluster test. 

To test the reproducibility and accuracy 
of the genotyping method, we genotyped a 
set of 91 loci (passing the cluster test) in 
three individuals by performing chip-based 



genotyping on six separate occasions over a U The resources reported here represent 
2-month period. The correct genotypes , only a first step toward a dense SNP map of 
independently determined by thor- * the human genome. The genetic map 
— e - 0 el-based resequencing. The genotyp- £ should already be useful for family-based \f 
ing-chip assay assigned a genotype in 98% ^ linkage studies, given the average spacing 



of cases (1613/1638), and this assignment^ 
proved correct in 99.9% (1611/1613) of £ 
these cases. The loci were also genotyped in 
two complete CEPH families. The geno- 
types were not independently confirmed, 
but they were fully consistent with mende- 
lian segregation. 

For SNPs passing the cluster test, highly 
accurate genotypes could thus be obtained 
with the simple design used here. For the 
remaining SNPs (14%), similar accuracy 
can likely be obtained but may require op- 
timization of the genotyping array design, 
depending on the locus [as shown in (5)]. 

The SNP surveys provide data about 
human genetic diversity. Two classical mea- 
sures of diversity (30) are H, the average 
heterozygosity per nucleotide, and K, the 
proportion of sites harboring a variation. H 
does not depend on sample size, whereas K 
increases with the number of genomes sur- 
veyed. For a population at equilibrium, the 
neutral theory of evolution relates H and K 
to the classical population genetic parame- 
ter 0 = 4N e jx, where N e is the effective 
population size and u, is the mutation rate 
per nucleotide. (0 can be thought of as 
twice the number of new mutations per 
generation arising in a population with size 
N c .) Specifically, H ~ 0 and K 6 [1" 1 + 
r l +3- 1 + ... + (n- l)->], provided 
that 0 is small. From these equations, one 
can estimate 0 based on H or K. 

The human population is not at equi- 



librium, but rather underwent a rapid pop- 
ulation expansion in the last 100,000 to 
200,000 years. Such population explosions 
tend to suppress the effects of genetic drift 
and thus preserve the distribution of com- 
mon alleles and the value of 9. Accord- 
ingly, the value of 0 is* relevant to the 
ancestral human population before its re- 
cent expansion. 

The four estimates of 0 derived from H 
and K for the two surveys are all roughly 0 
« 4 X 10~ 4 (Table 1). Assuming a muta- 
tion frequency of u, — 10~ 8 to 10~ 9 , this 
would suggest an effective population size of 
N c «» 10 4 to 10 5 , which seems reasonable 
for the ancestral population preceding the 
explosion in the last 100,000 years (31). 
Strictly speaking, these estimates apply only 
to the European population, from which all 
samples were drawn. However, a prelimi- 
nary survey of a more diverse sample of 31 
individuals representing all major racial 
groups yielded a value of 0 that is only 30% 
larger (26), consistent with the idea that 
human variation occurs primarily within 
rather than between racial groups (32). 



Saiki et al., ibid., p. 6230; A.-C. Syvanen et al.. 
Genomics 8. 684(1990); D. A. Nickerson etal., Proc. 
Natl. Acad. Sd. USA. 87, 8923 (1990); K. J. Uvakef 
at., NatureGenet. 9, 341 (1995); M. T. Roskey et al., 
Proc. Natl. Acad. Set. USA. 93, 4724 (1996). 

5. M. T. Cronin etal., Hum. Mutat 7, 244 (1996). 

6. T. J. Hudson et al. t Science 270, 1945 (1995). 

7. G. D. Schuler etal., ibid. 274, 540 (1996). 

8. STSs with the largest sizes were used in the gel- 
based screen, and the remaining STSs, having 
somewhat smaller sizes, were used in the subse- 
quent chip-based screen. 

9. The genomic sequence screened (279 kb) is the sum 
of the distances between the primer sites of the STSs 
successfully resequenced. 

10. The individuate surveyed were chosen from Centre 
d'Etude du Poryrnorphisme Humain (CEPH) pedi- 
grees K104, K884, and K1331 from the Amish, Ven- 
ezuelan, and Utah populations, respectively. The 
SNP survey by gel-based sequencing examined 
three unrelated individuals (K1 04-1 , K884-2. K1 331 - 
1) and a pool of 10 individuals (K104-13, -14, -15, 
-16; K884-15, -16; K1331-12, -13. -14, -15). The 
SNP survey by chip-based analysis examined seven 
unrelated Individuals (K1 04- 1 , - 1 6; K884-2 , -15,-16; 
K1331-12, -13). 

1 1 . STSs were amplified with their corresponding PCR 
primers as described (6), except that the forward 
primer was modified to include the M13 -21 primer 
site (5' -TGTAAAACGACGGCCAGT-3') at its 5'-end. 
The resulting PCR products were subjected to dye- 
primer sequencing {33), with products detected on an 
ABI377 or ABI373 fluorescence sequence detector. 
Possible sequence variations were detected by the 
ABI Sequence Navigator software package, which 
suggests potential heterozygotes by identifying nucle- 
otide positions at which a secondary peak exceeds a 
selected threshold (50%). Such apparent variations 

^yivere then visually inspected to compare the patterns 
seen among the several individuals. 



(2cM)andave'rageheterozygosit;(34%)oV ,2 " ^Cooper and M. Kawczak. Hum. Genet. 85, 55 
the markers. (The heterozygosity applies to 13. 



ht European-derived samples studied here, 
3ut a preliminary survey of —180 of the 
SNPs shows that most are also polymorphic 
in other groups.) I t still remains to develop 
a suitable genotyping system, such as a 
2 000-SNP genotyping chip. 

Large-scale screening for human varia- 
tion is clearly feasible. Someday it may be- 
come possible to screen entire human ge- 
nomes. In the nearer term, a key goal will 
be to extend SNP discovery to the protein 
coding regions of all human genes (roughly 
120 Mb of sequence, only about 40 times 
more than the current study) in order to 
catalog the common variants that may ex- 
plain susceptibility to common genetic 
traits and diseases (I). 
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High density synthetic oligonucleotide arrays 
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Biological systems read, store and modify genetic information 
using the rules of molecular recognition. Every nucleic acid strand 
carries the capacity to recognize complementary sequences 
through base pairing. The process of recognition, or hybridization, 
can be highly parallel; every sequence in a complex mixture can, 
in principle, be interrogated simultaneously. We have used these 
simple principles to develop powerful new experimental tools 
designed to collect and analyse vast amounts of genetic and cellu- 
lar information. The introduction, development and integration 
of two key technologies 1 " 5 form the cornerstone of the new meth- 
ods. The first is the fabrication of hundreds of gjgujgnds of 
polynucleotides at high spatial resolution in precise locations on a 
surface. The second, laser confocal fluorescence scanning, facili- 
tates the measurement of molecular binding events on the array. 
These technologies and some variants have been adopted in both 
the commercial and academic sectors (see pages 25 (ref. 6), 10 
(ref. 7) and 1 5 (ref. 8) of this issue). 

At Affymetrix, we have focused on light-directed synthesis for 
the construction of high-density DNA probe arrays using two 
techniques: photolithography and solid-phase DNA synthesis. We 
attach synthetic linkers modified with photochemically remov- 
able protecting groups to a glass substrate and direct light through 
a photolithographic mask to specific areas on the surface to pro- 
duce localized photodeprotection (Fig. 1). The first of a series of 
chemical building blocks, hydroxyl-protected deoxynucleosides, 
is incubated with the surface, and chemical coupling occurs at 
those sites that have been illuminated in the preceding step. Next, 
light is directed to different regions of the substrate by a new 
mask, and the chemical cycle is repeated 9 ' 10 . Highly efficient 
strategies can be used to synthesize arbitrary polynucleotides at 
specified locations on the array in a minimum number of chemi- 
cal steps 1 . For example, the complete set of 4 N polydeoxynu- 
cleotides of length N, or any subset, can be synthesized in , only 
4xN cycles. Thus, given a reference sequence, a DNA probe array 
can be designed that consists of a highly dense collection of com- 
plementary probes with virtually no constraints on design para- 
meters. The amount of nucleic acid information encoded on the 
array in the form of different probes is limited only by the physical 
size of the array and the achievable lithographic resolution. 
Current large scale commercial manufacturing methods allow for 
approximately 300,000 polydeoxynucleotides to be synthesized 
on small 1.28x 1.28 cm arrays — experimental versions now 
exceed one million probes per array. 



Photolithography allows the construction of arrays with 
extremely high information content. Because the arrays are 
constructed on a rigid material (glass), they can be inverted and 
mounted in a temperature-controlled hybridization chamber. 
A fluorescently tagged nucleic acid sample injected into the 
chamber hybridizes to complementary oligonucleotides on the 
array. Laser excitation enters through the back of the glass sup- 
port, focused at the interface of the array surface and the target 
solution. Fluorescence emission is collected by a lens and passes 
through a series of optical niters to a sensitive detector. By simply 
scanning the laser beam or translating the array, or a combina- 
tion of both, a quantitative two-dimensional fluorescence image 
of hybridization intensity is quickly obtained 1,2 . 

Gene expression monitoring 

Once sequence information (partial or complete) for a gene is 
obtained, the next question is generally: "what does its product 
do?" To understand gene function, it is helpful to know when 
and where it is expressed, and under what circumstances the 
expression level is affected. Beyond questions of individual gene 
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Fig. 1 a. Light directed oligonucleotide synthesis. A solid support is derivatized 
with a covalent linker molecule terminated with a photolabile protecting 
group. Light is directed through a mask to deprotect and activate selected 
sites, and protected nucleotides couple to the activated sites. The process is 
repeated, activating different sets of sites and coupling different bases allow- 
ing arbitrary DNA probes to be constructed at each site. b. Schematic represen- 
tation of the lamp, mask and array. 
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Exploring the new world of the genome 
with DNA microarrays 
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The genome project has revitalized exploration in biological organisms* (currently the eukaryotes Socc/iaromycw cerevwiae and 
research. Not long ago, it was possible for biologists to imagine Caenorhabditis elegans, as well as dozens of bacterial species) pro- 
that the genes that had been discovered via mutations, selections vides us with such complete blueprints (http://www.ncbi.nlm. 
and cloning schemes represented a good approximation of the nih.gov/Entrez/Genome/org.html). These genome sequences 
total universe of genes, and that the proteins already discovered have not only made a new era of exploration imperative, but, 
on the basis of their abundance, location, or activity well repre- providentially, they have also made it possible, 
sented the total universe of proteins. One of the great con- DNA microarrays provide a simple and natural vehicle for 
tributions of the genome project has been to show us what a small exploring the genome in a way that is both systematic and com- 
part of mis world was really known to us, and how much of this prehensive 4-10 . The power and universality of DNA microarrays 
world remains to be explored. In April 1996, the complete as experimental tools derives from the exquisite specificity and 
sequence of the yeast genome confronted us with the fact that affinity of complementary base-pairing. We are provided thereby 
yeast contain approximately 6,200 Veal* genes, as judged from with an instant experimental handle on DNA or RNA unlike any 
open reading frames, for only one quarter of which could we haz- we possess for any other biological molecules. A DNA copy of an 
ard a guess regarding function 1 (http://www.ncbi.nlm.nih.gov/ individual gene provides a nearly ideal reagent for specific and 
Entrez/Genome/org.html). The tens of thousands of partial quantitative detection and measurement of the sequence of the 
human cDNA sequences representing previously unseen genes gene, even in an extremely complex mixture. For this reason, the 
have had a similar humbling effect 2 . Although we may have sus- sequence information provided by the genome project has had 
pected its existence, the actual discovery of this genetic terra incog- an instantaneous impact on experimental biology. 
nita has jolted biology much as the discovery of America jolted The method used in our labs is simple to describe (complete 
Europe 500 years ago — showing us how much of the world is details and protocols are available, http://cmgm.stanford.edu/ 
beyond the frontier — mysterious, tantalizing and unexplored. pbrown). Briefly, arrays of thousands of discrete DNA sequences 

(for PYamplp, all nftho Irnnwn cnH predicted ffffflffi " f S CCTg- 

Exploring the genome and the natural world ^ao4^l£^Qfitfi^ 

with DNA microarrays agayiCterf*^ of this 

Exploration means looking around, observing, describing and issue). To compare the relative abundance oTeacJToT these gene 

mapping undiscovered territory, not testing theories or models, sequences in two DNA or RNA samples (for example, the total 

The goal is to discover things we neither knew or expected, and to mRNA isolated from two different cell populations; Fig. 1), the 

see relationships and connections among the elements, whether two samples are first labelled using different fluorescent dyes (say, a 

previously suspected or not It follows that this process is not red dye and a green dye). They are then mixed and hybridized with 

driven by hypothesis and should be as model -independent as the arrayed DNA spots. Use of differentially labelled mixtures 

possible (see page 54 of this issue (ref. 3)). We should use the avoids most of the complications of hybridization kinetics; we 

unprecedented experimental opportunities that the genome always measure the ratio. After hybridization, fluorescence mea- 

sequences provide to take a fresh, comprehensive and open- surements are made with a microscope that illuminates each DNA 

minded look at every question in biology. If we succeed, we can spot and measures fluorescence for each dye separately; these mea- 

expect that many of the new models that emerge will defy con- surements are used to determine the ratio, and in turn the relative 

ventional wisdom. abundance, of the sequence of each specific gene in the two mRNA 

Exploring and surveying are best done systematically. The or DNA samples. There are, of course, other microarray systems 

genome, representing the complete blueprint of the organism, and methods, most notably the oligonucleotide arrays developed 

is the natural bounded system in which to conduct this explo- by Affymetrix 67 ' 13 , which differ in many details but share the 

ration. The completion of the genomic sequences of model essential simplicity of this experimental design. 
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Gene chips: Array of hope for understanding gene regulation 
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High density arrays of DNA fragments on a solid surface 
allow the expression of thousands of genes to be 
assessed in a single experiment The development of 
this 'gene chip' technique heralds a new era of studies 
that promises to provide an integrated view of the 
expression of all genes of an organism. 
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The realization almost a half century ago that genes that 
are co-regulated often encode proteins of related func- 
tion fueled a remarkable period of discovery about mech- 
anisms of gene regulation. The paradigms provided by 
this work form a cornerstone of molecular biology [1]. 
These paradigms were verified and extended „ by 
painstaking dissection of regulatory mechanisms, operon 
by operon (for eukaryotes, regulon by regulon). While 
'global* regulatory mechanisms that act upon large sets of 
genes were, recognized early on [2], their analysis has 
been, for the most part, limited to a small number of rep- 
resentative genes subject to global control. Our present 
knowledge of how genes are regulated thus stems from 
analysis of a limited number of genes. A revolutionary 
new- technology for measuring expression of all genes of 
an organism in a single experiment has now been 
devised [3-5]. This may herald a new era of investigation 
of^ge ne regulation that promises to provide a muc h 
deeper understanding of how cells coordinate expression 
of thousands of genes. 



This advance was made possible by the development of 
technology that allows DNA fragments to be arrayed at 
high density on a solid support for use in hybridization 
experiments [6,7]. Thousands of DNA fragments c an be 
arrayed on a surface no larger than a fingernail and used to 
probe the mRNA content of cells. Thus, whole genomes 
can be assessed for their pattern of gene expression, 
enabling us, for the first time, to view gene regulation in 
the context of all the complex networks of pathways that 
operate in cells. We can now identify all the genes of an 
organism that change expression under a given condition, 
and hope to make sense of the celPs response to that con- 
dition. Such information can provide key clues to the 
function of individual proteins. Moreover, the ability to 
acquire data of this kind is a big step toward achieving the 
ultimate goal of molecular biology: a complete under- 
standing of cellular function. 



Two different methods for arraying large numbers of_ 
DNA molecules in a very small space have been devel-_ 
o ped. In one, cDNA sized fragments — usuall y produced 
b y the polymerase cri'aih reaction (PGR) — are_spotted 
QfllQ^^ yJysjaej^oa te d glass slides [6]. In ' thej^ther^ short 
( -25 nucleotide) ^jgo^^c o tiHe^ Se~syntKiil^E3n'ra 

h ave been called 'chips', but this mon iker fits the oligonu*^ 
cleotide arrays better, because they are'rS 
tofixKo^^ 

c omputer chip sTBot hm 

fraymerlr^'mto a very small area: the currypt , oli po nu- -Jfa w>+4^ 
cleotide chips display all 6000 yeast p;enes ™}fo"jr 
L? ^_li 2Xrm r ^ ips; the Ui\ A tragment microarray?"fi t 
thejsame informatio n i onto a sin gle 1.8 x O cm fllass slide 
(see Figure 1). On the oligonucleotide chips, each gene 
must be represented by several (typically 20) different 
oligonucleotides, because of the differences in hybridiza- 
tion properties and reduced hybridization specificity 
inherent in such short probes. In addition, each oligonu- 
cleotide on the chips has a partner adjacent to it that 
differs at just one central base, which serves as an internal 
control for hybridization specificity. Each gene thus 
encompasses about 40 'features* — a feature being an area 
of the glass surface occupied by DNA molecules of one 
sequence — on an oligonucleotide chip, whereas it takes 
up only one feature on a DNA fragment microarray. 

The DNA arrays are used to interrogate complex mixtures 
of nucleic acids, and thus are similar to the 'dot blots* that 
have been in use for a long time [8,9]. They differ from 
dot blots in the nature of the labelled species that serves 
as the probe — in dot blots, the complex mixture of 
mRNAs is fixed to the solid surface and probed with a 
single labelled DNA fragment; in the DNA microarrays, 
individual unlabelled DNA fragments are fixed on the 
solid support and probed with a complex mixture of 
labelled cDNAs or mRNAs. The major advance of the 
arrays over the older technology is a significant increase in 
sensitivity, primarily as a result of two factors. Because the 
labeled probe is usually the limiting component in nucleic 
acid hybridization, probably the more important factor is 
the small area occupied by the arrays, which significantly 
reduces the volume of the hybridization solution — from 
milliliters to microliters — and thereby greatly increases 
the concentration of the probe. Because of the small area 
they occupy, sophisticated lasers and sensitive detection 
systems are required to measure the hybridization signals. 
The second factor is that the glass surface of an array gen- 
erates a smaller background hybridization signal than the 
porous membranes used for dot blots. Both kinds of 
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Figure 1 




A DNA fragment array of all - 6000 yeast genes probed with labeled 
cDNA made from galactose- and glucose-grown cells. Each spot 
(element) on the array contains a cDNA-sized DNA fragment 
representing one yeast coding sequence. mRNA from galactose- 
grown cells was converted to red-labeled cDNA (using dUTP labeled 
with the fluorescent dye Cy3); mRNA from glucose-grown cells was 
converted to green-labeled cDNA (with the dye Cy5). These two 
preparations of labeled cDNA were mixed and used to probe the array. 
Red spots bind only galactose-grown cDNA, and thus represent genes 
expressed only in galactose-grown cells; green spots bind only cDNA 
from glucose-grown cells, and therefore represent genes expressed 
only in glucose-grown cells. Spots containing genes expressed under 
both conditions hybridize to both cDNAs, and thus appear yellow. The 
intensity of the color of each spot (from red to green) reveals the 
relative expression level of genes under the two conditions. (Figure 
courtesy of Joe DeRisi, Vishy Iyer, and Pat Brown; for more of these 
images, see [1 7J.) „ 



microarray thus permit very sensitive detection of gene 
expression: currently, an mRNA present at a level less 
than one molecule in 100,000 can be detected, equivalent 
to a transcript present at only one copy per 20 yeast cells! 

The DNA fragment microarrays can be produced by 
anybody with the ability and modest means required to 
assemble the equipment to print the arrays [10]. Produc- 
tion of the DNA fragments to be arrayed does, however, 
require a large number of oligonucleotides for the PGR, 
which can be prohibitively expensive, and generation of 
the PCR products is labor intensive. (For yeast, much of 
this work has already been done [11].) A limitation of the 
oligonucleotide chips is that knowledge of the DNA 
sequences to be studied is necessary to produce them, 
whereas random cDNA clones can be used in the DNA 
fragment microarrays. Also, dependence on commercial 



sources for the oligonucleotide chips may present limita- 
tions of availability and affordability. Both methods require 
fairly sophisticated microscopy and software for detecting, 
measuring and identifying hybridization signals from the 
arrays. This technology currently seems out of the reach of 
the average lab, but commercial services are sprouting to 
provide the microarrays and equipment necessary to make 
this technology widely accessible. In the meantime, the 
whole genome dot blots that have recently become avail- 
able, at least for yeast, may fulfil the needs of most labs 
that want to perform these kinds of experiments [12]. 

The utility of the two kinds of microarray for measuring 
expression of a large number of genes was established pre- 
viously [6,13-15], but was spectacularly demonstrated 
recently by two groups who used them to measure expres- 
sion of all 6000 genes of the bakers' yeast, Saccharomyces 
cerevisiae, grown under a few different conditions [3,4]. 
Wodicka eta/. [4] compared gene expression in yeast cells 
grown on rich and minimal media. They isolated po!yA + 
RNA from cells grown under the two conditions, con- 
verted it into cDNA flanked by a promoter for T7 RNA 
polymerase, and copied it into antisense, biotin-labeled 
RNA by transcription in vitro. This final step amplifies the 
mRNA probe, apparently without introducing significant 
bias. Labeled RNA made in this way from cells grown 
under the two conditions was used to probe the oligonu- 
cleotide chip, and the bound RNA was detected and 
quantified using streptavidin conjugated to a fluorescent 
dye, yielding highly reproducible results. More than 87% 
of yeast mRNAs were detected, with a dynamic range of 
about three orders of magnitude. 

Similar results were obtained by DeRisi et al. [3], who 
used the DNA fragment microarrays to measure gene 
expression in yeast cells as they run out of glucose. They 
isolated polyA + RNA from a culture of cells at several dif- 
ferent times after inoculation into glucose media, fluores- 
cently labeled it by reverse transcription, and used the 
labelled product to probe DNA fragment microarrays. The 
two types of array seem roughly comparable in their sensi- 
tivity, range and reproducibility. The oligonucleotide 
chips may be better at measuring relative expression dif- 
ferences, because they easily revealed more than 50-fold 
differences in expression, whereas the maximum expres- 
sion difference measured with the DNA fragment 
microarrays was 20-fold (although it is difficult to compare 
the results of the two experiments, as they employed very 
different growth conditions). 

The results presented by DeRisi et al. [3] and Wodicka et 
al. [4] mostly serve to validate the experimental approach, 
but in a very satisfying way, as many of the changes in 
gene expression that were observed were expected. 
DeRisi et al. [3], for example, rediscovered the fact that, 
when yeast cells run out of glucose, the expression of 
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genes for oxidative metabolism and gluconeogenesis 
increases, and the expression of genes for fermentation 
and protein synthesis decreases. That these results 
conform almost perfectly to what is known about regula- 
tion of these well-studied genes lends great confidence to 
the technique. 

Similarly, many of the genes expected to have higher 
levels of expression in cells grown on minimal media than 
on rich media — such as those involved in nitrogen acqui- 
sition or amino acid synthesis — were identified with the 
oligonucleotide chips, as were many genes that have the 
converse expression pattern, such as those involved in 
amino-acid transport. The technique is not perfect, 
however, as the DNA fragment microarrays missed several 
genes whose expression is known to be regulated by 
glucose — for example, HXT1, which is induced about 
300-fold by glucose [16], and GAL4> which is about 75% 
repressed by glucose [17], (These omissions could be 
easily uncovered, because all the results are publicly avail- 
able in a terrific, searchable database [18].) Nevertheless, 
the microarrays work better than most of us imagined they 
would, and provide a wonderful tool that greatly . expands 
our horizons. 



Tupl is a general repressor. Interestingly, expression of a 
few genes decreased significantly in a tupl mutant, sug- 
gesting that Tupl may also activate transcription in 
certain cases. In a separate set of experiments, genes 
whose expression changes when the Yapl transcription 
factor is overexpressed were identified. This revealed a 
set of genes whose expression increased, indicating that 
Yapl is a transcriptional activator. Again, expression of a 
few genes decreased significantly upon Yapl overexpres- 
sion, suggesting that Yapl may also be a repressor. 

A major problem with interpretation of these results is the 
difficulty in ascribing them to direct action of the 
transcription factor that is inactivated. In fact, it seems a 
good bet that indirect effects account for the unexpected 
responses to Tupl absence and Yapl overexpression. 
Nevertheless, the wealth of data provided by the microar- 
rays allows the formulation of hypotheses that can be 
tested with other, more conventional experiments. The 
practical uses of this technology to identify candidate 
compounds for drug development are obvious. Further- 
more, the microarrays are sure soon to be in wide clinical 
use, where they will undoubtedly aid in disease diagnosis 
and treatment. 



What have these experiments taught us about cellular 
function? They revealed that almost 90% of yeast genes 
are expressed, most at very low levels (69% with one or 
fewer mRNAs per cell) [4], but this has long been known 
from the classic work of Hereford and Rosbash [19]. Simi- 
larly, many of the genes DeRisi etal. [3] found to be regu- 
lated by glucose have long been known to be subject to 
such regulation. A substantial number of genes, however, 
were found for the first time to be regulated in these two 
studies, and nothing is known about a significant propor- 
tion of these. The regulatory patterns of these proteins 
thus provide a first clue to their function. These results 
also allow genes to be grouped by their expression pattern, 
as was done insightfully by DeRisi etal. [3]. The function 
of at least some of the genes in a group is usually known, 
allowing inferences to be made about the possible func- 
tion of the other genes in the same group. Clearly, this 
technology will speed the pace of discovery of protein and 
cellular function. 

One of the most promising applications of DNA microar- 
rays is the identification of all the genes whose expression 
changes when a gene is inactivated. This is a boon for 
those interested in transcription factors, as this informa- 
tion should help reveal their role in cellular physiology, 
and might even speak to their mechanism of action. 
DeRisi et al. [3] identified all yeast genes whose expres- 
sion changes when the Tupl transcription factor is inacti- 
vated by mutation. The expression of many genes 
increased significantly as a result of deletion of TUP1, 
which would probably lead one to conclude correctly that 



Now that the DNA microarrays are clearly working well 
for the analysis of gene expression, the major challenge is 
to handle and interpret the massive amounts of data that 
will quickly accrue. Just from the two reports of DeRisi et 
al. [3] and Wodicka et al. [4], there is a rich vein of infor- 
mation waiting to be mined that is sure to grow as this 
technology becomes widely available. But the problem we 
are faced with is a pleasant one: we are not limited by the 
amount of data we can collect, but by our ability to inter- 
pret it. If we are able to do so successfully, great insight 
into cellular function is promised. It is unlikely to change 
our paradigms, but it will take us one large step closer to 
the goal of a complete understanding of how cells work. 
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Exploring the Metabolic and Genetic Control of 
Gene Expression on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 

DN A microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to carry out a comprehensive investigation of the temporal program of gene expression 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays were also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1. These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 



The complete sequences of nearly a dozen 
microbial genomes are known, and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. This is fortuitous because the only 
specific reagent required to measure the 
abundance of the, mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays^consjsJiQg nf l-hnnsflr^ Q f d ividua l 
gene sequences printed in a high-density 
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favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cis regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of tools is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermentable sugar is 
exhausted, the yeast cells turn to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PCR), with a commercially available set of 
primer pairs (8). DNA microarrays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 



using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (JO). Fluorescently 
labeled cDN A was prepared by reverse tran- 
scription in the presence of Cy3(green)- 
or Cy5(red)-labeled deoxyuridine triphos- 
phate (dUTP) (i 1) and then hybridized to 
the microarrays (12). To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDNA 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43,000 expression-ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2-hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more for only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2.7-fold (M). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4. About half of these differentially ex- 
pressed genes have no currently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 
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to any gene whose function is known (15). 
The responses of these previously unchar- 
acterized genes to the diauxic shift therefore 
provides the first small clue to tneir possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
Flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase (ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACS J), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl-CoA, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCKl , encoding 
phosphoenolpyruvate carboxykinase, and 
FBP J , encoding fructose 1 ,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
coses-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- . 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coord i- 
nately induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosomal proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) (13). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis (13). As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
celPs response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, ' seven genes 
showed a late induction profile, with mRNA 
levels increasing by more than ninefold at 
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the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (16-20). A search 
in the promoter regions of the remaining two 
genes, ACR1 and 1DP2, revealed that 
ACR1, a gene essential for ACS1 activity, 
also possessed a consensus CSRE motif, but 
interestingly, IDP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with the exception 




Fig. 1, Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confocal microscope used to coilect aii the data we report (49). A fluorescently labeled 
cDNA probe was prepared from mRNA isolated from cells harvested shortly after Inoculation (culture 
density of <5 x 10 6 cells/ml and media glucose level of 19 g/lrter) by reverse transcription in the 
presence of Cy3-dUTP. Similarly, a second probe was prepared from mRNA Isolated from cells taken 
from the same culture 9.5 hours later (culture density of ~2 x 10 8 cells/ml, with a glucose level of 
<0.2 g/iiter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
Cy3-dUTP-labeled cDNA (that is, mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-!abeled cDNA (that Is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 
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of HSP42, have previously been shown to 
be controlled at least in part by these 
elements (21-24). Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of these genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile [including 
HSP30, ALD2, OM45, and 10 uncharac- 
terized ORFs (25)], nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2,3 t 4 has been shown 
to be responsible for induction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCAAT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2,3 t 4 (30). Indeed, a putative 
HAP2 t 3 1 4 binding site could be found in 
the sequences upstream of each of the seven 
cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5D). Of 12 additional cytochrome c-related 
genes that were induced, HAP2 i 3 ) 4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS™) 
that is recognized by the Rapl DNA-bind- 
ing protein (31, 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl -binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression (34). Indeed, we ob- 
served that the abundance of RAP I 
mRNA diminished by 4.4-fold, at about 
the time of glucose exhaustion. 

Of the 149 genes that encode known or 
putative transcription factors, only two, 
HAP4 and SIP4, were induced by a factor of 
more than threefold at the diauxic shift. 
SIP4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl, the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of SJP4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measurements obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confi- 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
integration of many kinds of information 
about the nutritional and metabolic state 
,of the cell. The large number of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. The section of the ar- 
ray indicated by the gray box 
in Fig. 1 is shown for each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expression during the diauxic 
shift, red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 
relative to the initial timepoint. 
In the arrays used to analyze 
the effects of the tup1& mu- 
tation and YAP1 overexpres- 
sion, red spots represent 
genes whose expression was 
Increased, and green spots 
represent genes whose ex- 
pression was decreased by 
the genetic modification. Note 
that distinct sets of genes are 
induced and repressed in the 
different experiments. The 
complete images of each of 
these arrays can be viewed on 
the Internet (13), Cell density 
as measured by optical densi- 
ty (OD) at 600 nm was used to 
measure the growth of the 
culture. 
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by mutations in each putative regulatory 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUP I gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



Migi and is mediated by recruiting the tran- 
scriptional co-repressors Tupl and Cyc8/ 
Ssn6 (39). Tupl has also been implicated in 
repression of oxygen-regulated, mating-type- 
specific, and DNA-damage-inducible genes 
(40). 



Wild- type yeast cells and cells bearing 
a deletion of the TUP J gene (tupl A) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively (11). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tupl A 
strain, and thus presumably repressed by 
Tupl (41). A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tupl A mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion [complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (13)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUP J, suggesting that these genes may be 
subject to TUP1 -mediated repression by 
glucose. For example, SUC2 t the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUP J. 

The set of genes affected by Tupl in this 
experiment also included ot-glucosidases, 
the mating- type-specific genes MFA1 and 
MFA2, and the DNA damage-inducible 
RNR2 and RNR4, as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal corre- 
sponding to expression of TUP I itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tupl A strain, providing a positive 
control in the experiment (42), 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For instance, al- 
though only about 3% of all yeast genes 
appeared to be TUP J -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUP I 
was deleted. Another group of related 
genes that appeared to be subject to TUP I 
repression encodes the serine-rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2. 5 -fold in the tupl A 
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Fig. 3. Metabolic reprogramming Inferred from global analysis of changes in gene expression. Only key 
metabolic Intermediates are Identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are Identified by name In the boxes. The genes encoding succinyl-CoA synthase 
and glycogen-debranching enzyme have not been explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succlnyl-CoA synthase and glycogen-debranching en- 
zymes, respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is Indicated for these genes. For multimerlc enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of all the 
genes listed in the box. Black and white boxes indicate no significant differential expression (less than 
twofold). The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression pattern, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted in red. 
The broad gray arrows represent major Increases In the flow of metabolites after the diauxic shift, 
inferred from the Indicated changes in gene expression. 
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strain, and 18 of these genes were induced 
by more than sevenfold when TV? I was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUP1 . Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFAI 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tu/>JA 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MATA 
strain (in which expression of MFAI and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAP I en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAPI in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metals, and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing YAP J 
under the control of the strong GAL1-10 
promoter, both grown in galactose (that is, 
a condition that induces YAP/ overexpres- 
sion). Complementary DNA from the con- 
trol and YAP] overexpressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRNA isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ing YAPI. 

Of the 17 genes whose mRNA levels 
increased by more than threefold when 



YAPI was overexpressed in this way, five 
bear homology to aryl-alcohol oxidoreduc- 
tases (Fig. 2 and Table 1). An additional 
four of the genes in this set also belong to 
the general class of dehydrogenases/oxi- 
doreductases. Very little is known about 
the role of aryl-alcohol oxidoreductases in 
S. cerevisiae, but these enzymes have been 
isolated from ligninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 

Fig. 4. Coordinated reg- 
ulation of functionally re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for all the genes In 
each indicated group. 
The total number of 
genes in each group was 
as follows: ribosomal 
proteins, 1 1 2; translation 
elongation and initiation 

factors, 25; tRNA synthetases (excluding mitochondiaJ synthetases), 17; glycogen and trehalose syn- 
thesis and degradation, 15; cytochrome c oxidase and reductase proteins, 19; and TCA- and glyoxy- 
late-cycle enzymes, 24. 

Table 1 . Genes induced by YAPI overexpression. This list includes all the genes for which mRNA levels 
Increased by more than twofold upon YAP1 overexpression in both of two duplicate experiments, and 
for which the average increase in mRNA level In the two experiments was greater than threefold (50). 
Positions of the canonical Yap1 binding sites upstream of the start codon, when present, and the 
average fold-increase in mRNA levels measured in the two experiments are indicated. 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing YapL Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yap 1 -binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two-thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 
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327 


OYE2 


NAD(P)H oxidoreductase (old yellow 
enzyme), isoform 1 


4.1 


YML131W 


507; 




Similarity to A thatlana zeta-crystallin 
homolog 


3.7 


YOL126C 




MDH2 


Malate dehydrogenase 


3.3 
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ing sites upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical binding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors. Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DN A microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the ideal chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. The 
hurdles to extending this approach to any 
other organism are minor. The equipment 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. I t was feasible for a small gr oup 
to acc ompliirv the amplification ot mor e 
than ouu U genes in about 4 months and , 
oncFtHeampRfied gene^eju^ces^wereln 
hand, onlvJLday^Wjeje -.requireo r to" prinf a 
s et of 110 microarrays of 6400 element s 
eaciv^ Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gene and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
preting,, and extracting insights from the 
large volumes of data these experiments 
will provide. 
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Fig. 5. Distinct temporal patterns of induction or repression help to group genes that share regulatory 
properties. (A) Temporal profile of the cell density, as measured by OD at 600 nm and glucose 
concentration in the media. (B) Seven genes exhibited a strong induction (greater than ninefold) only at 
the last timepoint (20.5 hours). With the exception of /DP2, each of these genes has a CSRE UAS. There 
were no additional genes observed to match this profile. (C) Seven members of a class of genes marked 
by early induction with a peak in mRNA levels at 18.5 hours. Each of these genes contain STRE motif 
repeats in their upstream promoter regions. (D) Cytochrome c oxidase and ubiquinol cytochrome c 
reductase genes. Marked by an induction coincident with the diauxic shift, each of these genes contains 
a consensus binding motif for the HAP2,3,4 protein complex. At least 17 genes shared a similar 
expression profile. (E) SAM 7, GPP 7, and several genes of unknown function are repressed before the 
diauxic shift, and continue to be repressed upon entry into stationary phase. (F) Ribosomal protein 
genes comprise a large class of genes that are repressed upon depletion of glucose. Each of the genes 
profiled here contains one or more RAP1 -binding motifs upstream of its promoter. RAP1 is a transcrip- 
tional regulator of most ribosomal proteins. 
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